Enhancing Security in Social Networks through Machine Learning: Detecting and Mitigating Sybil Attacks with SybilSocNet

Cárdenas-Haro, José Antonio; Salem, Mohamed; Aldaco-Gastélum, Abraham N.; López-Avitia, Roberto; Dawson, Maurice

doi:10.3390/a17100442

Open AccessArticle

Enhancing Security in Social Networks through Machine Learning: Detecting and Mitigating Sybil Attacks with SybilSocNet

by

José Antonio Cárdenas-Haro

^*,†

,

Mohamed Salem

^†

,

Abraham N. Aldaco-Gastélum

^†

,

Roberto López-Avitia

and

Maurice Dawson

School of Computer Sciences, Western Illinois University, 1 University Cir, Macomb, IL 61455, USA

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Algorithms 2024, 17(10), 442; https://doi.org/10.3390/a17100442

Submission received: 30 July 2024 / Revised: 9 September 2024 / Accepted: 18 September 2024 / Published: 3 October 2024

(This article belongs to the Section Evolutionary Algorithms and Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

:

This study contributes to the Sybil node-detecting algorithm in online social networks (OSNs). As major communication platforms, online social networks are significantly guarded from malicious activity. A thorough literature review identified various detection and prevention Sybil attack algorithms. An additional exploration of distinct reputation systems and their practical applications led to this study’s discovery of machine learning algorithms, i.e., the KNN, support vector machine, and random forest algorithms, as part of our SybilSocNet. This study details the data-cleansing process for the employed dataset for optimizing the computational demands required to train machine learning algorithms, achieved through dataset partitioning. Such a process led to an explanation and analysis of our conducted experiments and comparing their results. The experiments demonstrated the algorithm’s ability to detect Sybil nodes in OSNs (99.9% accuracy in SVM, 99.6% in random forest, and 97% in KNN algorithms), and we propose future research opportunities.

Keywords:

machine learning; Sybil attacks; online social networks; cybersecurity; random forest; support vector machine; KNN

1. Introduction

1.1. Background and Motivation

In this current era, Wi-Fi and various peer-to-peer (P2P) networks are integral to our daily lives. Safeguarding such networks and their sensitive data from various cyberattacks is paramount. Sybil attacks have emerged as a prominent concern (Saxena et al. [1]), targeting and exploiting the vulnerabilities of distributed and P2P networks, thus infiltrating and compromising networks. Thus, these attacks pose serious risks to applications ranging from military operations to social networks and everyday activities like connected commuting [2,3,4,5]. Given increasing sophistication and prevalence of Sybil attacks, they pose a significant threat to the integrity and security of many distributed systems, like online voting systems, social media platforms, and even blockchain networks, among others. In all these cases, the attacker creates and controls multiple fake nodes or identities within a network, with aims of compromising user privacy, manipulating voting outcomes, and undermining the consensus mechanisms of the system. The consequences for a network that becomes a victim of a Sybil attack is more than just immediate compromise: empowering the attackers to manipulate the network through fake identities, thus identifying newer vulnerabilities [1]. This study’s comprehensive literature review focused on such growing challenges posed by Sybil attacks and prevention strategies, with the aim of contributing a novel machine learning technique to detect Sybil attacks [6].

So, the primary motivations for continued research in Sybil node detection and attack prevention are preserving the integrity, reliability, and trustworthiness of distributed systems. Sybil attack resistance is a challenging goal that often requires significant compromises in terms of complexity, performance, and cost [7]. The development of effective detection techniques for the constantly changing configurations and dynamics in social networks is an important but not an easy task. The aim is to protect the core functionalities of the system and ensure that all are legitimate nodes but still preserving the users’ anonymity. Otherwise, Sybil attacks may have severe social and economic consequences. Sybil attackers can influence a voting process to favor their interests, leading to an unfair distribution of rewards or gaining control over a network. They can also track user behavior, spread misinformation, and influence public opinion [8,9]. Effective detection and prevention techniques can help safeguard user privacy and protect against identity theft. Novel detection algorithms or methods of detecting Sybil nodes are based on network topology, patterns of behavior, and some other factors depending on specific characteristics or cases. There are also techniques to prevent the creation and propagation of Sybil nodes, such as reputation systems, cryptographic techniques, and game-theoretic approaches [10].

1.2. Problem Statement and Research Question

Previous research [11,12,13] has leveraged machine learning (ML) algorithms, extreme learning machine (ELM), and support vector machine (SVM) to detect Sybil attacks. In online social networks (OSNs), ML-based detection methods are promising but remain under-researched. Hence, this study built upon the works of Du et al. [14] and other scholars, e.g., [15,16,17,18], to delve into Sybil attacks within OSNs. Sybil attacks manifest across various network types, e.g., wireless sensor networks (WSNs), mobile ad hoc networks (MANETs), and online social networks (OSNs) [16]. WSNs consist of interconnected nodes equipped with sensors for data processing and analysis. MANETs, featuring mobile nodes without fixed infrastructure, find applications in vehicular networks. OSNs, encompassing platforms such as Facebook and Twitter, offer attackers opportunities to introduce malicious nodes and undermine the network’s integrity [19]. The literature reports on the different patterns exhibited by Sybil attacks, such as direct and indirect attacks, stolen identities, or fabricated attacks [3,20]. In direct attacks, malicious nodes mimic legitimate ones, directly communicating with authentic nodes. Indirect attacks involve manipulated communications between legitimate nodes, confusing while altering information flow. Stolen identity or fabricated attacks involve malicious nodes claiming legitimate credentials and sometimes replacing authentic nodes without being detected [21,22,23]. Amidst the escalating Sybil attacks within OSNs, scholars are studying user identity protection from rampant fake identities (bots) [21,22]. Notably, networks can be exploited as propaganda tools if Sybil attacks persist; a paramount motivation of this study was to develop an innovative algorithm named “SybilSocNet”, designed to identify and counteract Sybil nodes within OSNs to address this challenge. The proposed algorithm harnesses ML methodologies such as K-nearest neighbors (KNN), support vector machine (SVM), and random forest, to optimize detection accuracy by extracting useful information from a large amount of raw data.

The research question in this case was as follows: How can the SybilSocNet algorithm, using machine learning techniques such as KNN, SVM, and random forest, effectively detect and mitigate Sybil attacks in online social networks (OSNs) by efficiently processing and analyzing large-scale raw data into informative matrices?

1.3. This Paper’s Significance and Contribution

As technology becomes increasingly intertwined with our daily lives, it is imperative to protect the integrity of networks. This study highlights the importance of addressing Sybil attacks and their potential consequences. The proposed algorithm, powered by robust machine learning techniques, contributes to global efforts to enhance network security in the face of ever-evolving cyber threats. Regarding the remainder of this article, Section 2 provides a comprehensive review of the literature on cybersecurity, network security, Sybil attacks, machine learning, social network security, and algorithmic and machine-learning-based detection. Our research aimed to investigate how the SybilSocNet algorithm can effectively identify and mitigate Sybil attacks in online social networks (OSNs) by leveraging a combination of machine learning techniques, including nearest neighbors (KNN), support vector machine (SVM), and random forest, via the extraction of patterns from raw data preprocessed as matrices.

Section 3 outlines the details of our algorithm. We begin by preprocessing the raw data to create matrices containing relevant information that can be used by the machine learning algorithms to detect patterns. To manage the large dataset efficiently, we divide the raw data into seventy-one smaller matrices rather than working with a single, massive matrix, which would require significant computational resources. This approach allows us to process the data using workstations instead of a supercomputer. Interestingly, the combined size of these seventy-one smaller matrices is still substantially smaller than the size of a single, large matrix, without compromising the valuable information needed for the machine learning methods.

Section 3 explains the research methodology applied in this study, followed by Section 4, presenting an analysis of the collected secondary open-source data. Section 5 discusses the findings, as well as reports on implications and limitations, and concludes with some open questions for future research.

2. Theoretical Foundation and Sybil Attack Landscape

2.1. Sybil Attacks: Concepts, Core Characteristics, Varieties, and Impact on Network Integrity

Scholars are concerned about how cybercrime is escalating rapidly, with unprecedented incidents. Considering that modern technologies facilitate swift data transfer through various communication protocols, with the key communication player being wireless technologies, e.g., 5G and WLAN, security must be assured for such wireless technologies [24]. By compromising robust security measures, numerous essential applications like mobile phones, satellites, access points, and wireless internet may face vulnerabilities, such as data breaches via such devices [25]. Researchers are keen to counter the security attacks on such technologies. Such wireless technology, due to its extensive usage across a wide spectrum of applications, is vulnerable, according to the Pew Research Center (2023), which highlights that about 97% of Americans own mobile phones, 85% of which are smartphones, with the reported American population being roughly 307 million [26]. According to Vincent [27], the count of mobile phones in 2010 was around 4.5 billion, a sizable target pool for attackers. As per NBC News [26], three-quarters of Americans possess laptops or desktops, which are thus sizable, vulnerable targets. Thus, considering their data sensitivity, an initiative to safeguard such a large pool of targets is required, particularly when such targets use wireless sensors in their modern applications, causing security concerns to escalate further. One drawback of wireless communication systems is unrestricted data and media sharing over networks, which are susceptible to signal interception, allowing potential attackers to spoof crucial information for launching attacks. When wireless networks are exploited, information necessary for successful attacks, i.e., an identity-based attack, is exploited. The literature sheds light on two simple yet destructive identity-based attacks: spoofing and Sybil attacks. Spoofing attacks are where attackers monitor targeted networks, gather legitimate user information, e.g., IP or MAC addresses, and pose as legitimate users to gain access to unauthorized services and data. A Sybil attack is a malicious node network infiltration to form multiple fake nodes that claim to legitimately deceive neighboring nodes by forwarding inappropriate information. By spoofing and legitimate information, attackers gradually introduce newer fake nodes, assuming control over time, thus enabling unauthorized data access and launching various attacks, e.g., denial of service attacks and data manipulation [28]. A Sybil attack targets P2P networks to create and control multiple fake identities, allowing an attacker to launch deceptive activities. Through an attacker’s multiple created and controlled identities, they can concurrently deploy multiple nodes, misleading the network into granting unauthorized data access, like a reviewer fabricating numerous product reviews on Amazon to fake positive customer feedback or creating multiple accounts on Twitter to influence users. Also, sensor networks suffer from Sybil attacks when attackers pose as legitimate nodes, cascading to deceptively obtaining precise user information through social engineering or network spoofing. Such private information fuels Sybil’s attacks, eventually granting attackers undue network influence [29].

According to Douceur [8], Sybil nodes illicitly claim multiple identities or fake IDs. Figure 1 depicts a Sybil attack, where a yellow node denotes legitimate ones and red nodes represent Sybil nodes. Sybil misleads the victim network during an attack when posing as legitimate, consequently redirecting traffic. For instance, node (A) mimics (H) to node (L) and (J) to node (K), causing traffic heading to node (A), which either forwards, alters, or blocks messages. Such a tactic extends to other red nodes. Attackers exploit victim nodes to wreak havoc, breach data, or lurk and launch attacks, yielding diverse outcomes. Such actions lead to malfunctioning networks with halted functions and data breaches entailing spoofed traffic between legitimate nodes. Also, it leaves manipulated networks that now meet attackers’ wishes, e.g., altered weather predictions that declare specific regions disaster zones [30].

2.2. Targeted Networks and Vulnerabilities

Sybil attacks target diverse networks of four types: (1) OSN platforms such as Twitter connect millions worldwide. An attack, as publicized by NBC [26], can impact masses and even nations. (2) Online store platforms such as Amazon can end up hosting fake product reviews, inflating a product’s image to drive sales; tactics like such often involve payment for generating fake reviews. (3) Vehicular ad hoc networks (VANETs) are a subset of mobile ad hoc networks (MANETs), where VANETs enable vehicular communication protocols like vehicle-to-vehicle (V2V), vehicle-to-roadside (V2R), and vehicle-to-infrastructure (V2I) connections. VANETs Sybil attacks vary, including those that impersonate by fabricating MAC addresses or certificates when attackers infiltrate by breaching sensitive data or disrupting vehicle traffic. Moreover, (4) wireless sensor networks (WSNs) include weather-related networks for environment tracking, disaster management, and industrial automation. Sybil attacks are apparent through nodes simulating sensors to transmit false data, overloading networks with malware, or deploying data spoofing or resource exhaustion tactics.

2.3. Unleashing Defense Strategies: Safeguarding against Sybil Attacks

Various techniques protect against Sybil attacks by mitigating their likelihood. However, no efficient solution can prevent such a threat [15,31,32,33]. Rahbari et al. [4] suggested a cryptography-based approach with overheads; Almesaeed et al. [17] proposed controllable overheads. From the view of the authors of this study, an optimal approach should balance security and performance, as also elaborated by Rahbari et al. [4]. It is essential to assess the widely recognized techniques for mitigating Sybil attacks, categorized by protection, detection, and mitigation. Such frequently cited strategies/techniques/approaches can be categorized into six types: (1) Certifications are recognized mitigation approaches involving certifying nodes through authorized identification [22]. Douceur and John exemplified VeriSign as a certification system that foils Sybil attacks, though challenges emerge without central certification. (2) Resource testing in known resource networks like sensor arrays [34], where nodes’ physical resources validate legitimacy through commonly known resource allocation statistics [3]. Consider a network of temperature sensors where all sensors possess identical computational power, memory, and battery capacity. In this mitigation approach, nodes’ physical resources are examined to ascertain their alignment with the attributes of legitimate nodes. While nodes execute tasks, reputation systems assess if the node is legitimate or Sybil. Lee et al. [35] used the resource-testing approach to detect Sybil nodes in wireless networks. Their system determines whether a node is Sybil based on how each node responds to a request. Newsome et al. (2004) showed how attackers use a single physical device to introduce multiple nodes into a network. Additionally, (3) in incentive-based detection, where Margolin et al. [13] proposed an economic detection protocol for Sybil attacks, an attacker is financially incentivized for two or more controlled identities. (4) Location/position and timing verification is based on detecting a Sybil attack based on its location. To demonstrate legitimacy, every node must be in a specific location. It detects the Sybil attackers by discovering if multiple nodes are in an exact location or moving in a coordinated or synchronous pattern [18]. (5) Random key pre-distribution, where nodes receive unique identifying keys before deployment [14], is a more suitable approach for Wireless Sensor Networks (WSNs) due to their limited computational and energy capabilities. Eschenauer et al. [36] and Dhamodharan and Vayanaperumal [37] introduced key management schemes specifically designed for distributed sensor networks. (6) Privilege attenuation, where Fong [12] counters Sybil attacks in Facebook-style social network systems (FSNSs) with privilege attenuation, an approach aligned with Denning’s principle of privilege attenuation, to limit and prevent Sybil attacks such as access restrictions among friends to reduce Sybil attack risks in FSNSs. Sybil detection algorithms have been deployed to detect if a Sybil attack is occurring. They determine nodes’ legitimacy by differencing malicious from legitimate nodes. Several scholars [2,31,32,38] have explored techniques for preventing and mitigating Sybil attacks. Quevedo et al. [39] suggested a SyDVELM mechanism employing the extreme learning machine (ELM) algorithm to detect Sybil attacks in VANETs. Yu et al. [40] proposed a SybilLimit protocol to counter Sybil attacks in OSNs. Patel et al. [16] examined Sybil detection algorithms for wireless sensor networks where nodes function as sensors. Often, WSNs are deployed for monitoring purposes, e.g., military applications, and their unattended nature renders them susceptible to Sybil attacks and cyber threats, such as a network of sensors measuring air speed and temperature to predict future weather. In this scenario, the sensor network’s communication becomes compromised when a new sensor is introduced under the guise of legitimacy but is, in fact, malicious. Such deceptive nodes can discriminate against fake results, distort weather forecasts, and propagate misleading data to other nodes. Furthermore, malicious nodes might persistently introduce seemingly legitimate nodes by mimicking the behavior of genuine ones. In both scenarios, successful infiltration by a node can profoundly disrupt the network’s functionality. In critical infrastructure, e.g., traffic systems, such a disruption can result in city-wide chaos, causing severe accidents. Similarly, a swarm of compromised military drones could result in significant casualties. To safeguard against such threats, constant network monitoring for Sybil attacks is crucial. Hence, this study aimed to comprehend the Sybil attack detection techniques within P2P networks.

2.4. Sybil Detection Algorithms

In WSNs, Dhamodharan et al. [37] formulated a Sybil detection algorithm: message authentication with a Random Password Generation (RPG) algorithm to identify and neutralize Sybil attacks. RPG generates routing tables for network nodes, comparing node details during message transmission to identify unauthorized nodes. The Compare and Match-Position Verification Method (CAM-PVM) detects Sybil attacks. Data are rerouted through verified nodes, omitting known Sybil nodes. However, due to the high computational demands, Dhamodharan et al. [37] advocate prevention prioritization over detection. CAM-PVM is selectively employed when data-sending nodes suspect a recipient, ensuring efficiency.

Regarding Sybil attack detection in mobile ad hoc networks (MANETs), Abbas et al. [41] introduced a detection scheme for versatile MANETs, widely applied across various fields. MANETs self-organize into versatile typologies driven by application needs with wireless nodes that connect/disconnect dynamically based on bandwidth. However, such wireless connectivity reduces capacities compared to those of wired connections. Security is challenging due to autonomy and the absence of centralized authority, rendering MANETs vulnerable to diverse cyber threats. All nodes in MANETs function as hosts/routers, enhancing their configurations while compromising their security. While wireless connectivity boosts data transfer rates, latency is introduced due to constrained resources like battery-powered small units. Despite these complexities, the unique properties of MANETs have led to their diverse applications.

Newsome et al. [42] examined security threats in sensor networks. Due to the cost-efficiency requirement for mass production, sensor nodes often lack advanced hardware, computational power, and storage space, limitations that hinder the use of robust cryptography algorithms, leaving networks vulnerable to threats like Sybil attacks. For instance, in a weather prediction ad hoc network, a Sybil node could manipulate forecasts to predict extreme conditions, or, in a military application’s MANET, a Sybil node breaching the network might endanger lives.

Abbas et al. [41] proposed the received Signal Strength (RSS)-based localization algorithm as a solution for Sybil attacks. An algorithm was leveraged by Chen et al. [43] for the identification of Sybil attacks within sensor networks. Chen et al. recognized its key role as a promising strategy for detecting Sybil attacks in MANETs. They benchmarked it as a measure of the signal’s strength received from a specific node at a given location. The RSS value varies depending on the signal’s source node location. According to Abbas et al. [41], one way to distinguish between Sybil and legitimate nodes is by analyzing the nodes’ entry and exit behaviors within the network. RSS is the measurement of signal strength received by a receptor, typically gauged at the receiver’s antenna. Several factors influence RSS, e.g., transmitter–receiver distance, transmission medium, and receiver signal power. RSS applications are crucial in detecting and estimating the geographical location of a transmitter. Abbas et al. [41] examined network nodes as they joined and exited a network assessing proximity and signal strength to distinguish legitimate from Sybil nodes. When a newly Sybil node enters a network, it is expected to exhibit a high RSS. Conversely, legitimate nodes tend to join a network upon receiving a signal, which results in a lower RSS value during network entry. Such discrepancy arises from the fact that Sybil nodes are spawned by nodes already in the network.

In contrast, legitimate nodes, drawn by proximity, promptly join the network with weaker signals, leaning toward a lower RSS value. The detection mechanism involves storing communication histories between neighboring nodes in individual nodes such that each node maintains a table that encompasses the entire communication history. Such a table becomes the foundation for each node’s assessment of whether its neighbor node is a Sybil. Such a table includes the signal strength measurements for the surrounding nodes. This information enables the algorithm to classify the investigated node. The data accumulated in such tables are obtained during direct node-to-node communication or even when an indirect node is engaged. Abbas et al. conducted two experiments to discern distinctive behaviors between Sybil and legitimate nodes. In experiment 1, they focused on distinguishing between the behaviors of Sybil and legitimate nodes, which was achieved by monitoring a node’s RSS over time. A legitimate node typically enters another node’s range gradually, strengthening its RSS as it becomes closer to the intended connections. Conversely, as it moves farther away, the RSS weakens. Figure 2 depicts the RSS progression values over time for both Sybil and legitimate nodes. For legitimate nodes naturally entering and exiting a network, the plot generates a bell-shape RSS curve (Figure 2A) versus a Sybil node, which exhibits a higher RSS compared to a legitimate node, as depicted in Figure 2B, resulting in a constant line followed by a negatively sloping curve, signifying that the nodes enter with consistently high RSS values that diminish as the node departs the network.

In the second experiment, the pace at which nodes entered the network significantly influenced the establishment of the threshold RSS value. It becomes crucial to ascertain the extent to which a node can traverse a specific node’s range before being detected. To investigate the relationship between node speed, penetration distance, and RSS value, Abbas et al. [41] experimented on node X and Y, where Y’s movement initiated away from X at varying speeds, while recording RSS values. Their findings revealed that a higher rate of movement allows nodes to penetrate into the range of other nodes while undetected. Such insights established the system threshold, assuming that no node could surpass a predetermined maximum speed.

3. Methodology: Proposed SybilSocNet Algorithm

This study employed ML algorithms to develop predictive models that are capable of identifying Sybil nodes within social networks. Considering the widespread popularity of such platforms, e.g., Twitter and Facebook, they are the basis for millions to convene globally. Hence, Sybil attacks occur within such networks by penetrating fake user accounts, so individuals can falsely assume identities. Such fraudulent accounts are employed to influence the public or manipulate network dynamics, involving propagating specific ideologies through orchestrated campaigns [44]. The modus operandi involves a Sybil attacker who controls multiple Sybil nodes while orchestrating fraudulent reports on the social network admin, where such reports falsely implicate legitimate accounts as fraudulent or compromised. Subsequently, the administrator proceeds with automated actions that restrict the legitimate users’ network access or even ban them. Cases of cybercrimes have been observed where social media was exploited to sway public opinion during elections, convincing votes for specific political parties. For instance, a hacker in Russia commissioned approximately 25,000 fake accounts, generating about 440,000 tweets that influenced public sentiment after a country’s parliamentary election [45]. To construct a supervised ML model skilled at discerning Sybil nodes in social networks and to distinguish Sybil nodes from legitimate ones, it is vital to acquire a dataset comprising labeled tweets or posts categorized as Sybil or legitimate. Such a dataset becomes the foundation for employing diverse unsupervised ML algorithms to train a model to differentiate malicious users, posts, or tweets from authentic ones.

Our research took the work by Binghui Wang et al. and their algorithm SybilSCAR as our base [46] to detect Sybil nodes from the raw dataset of the Twitter follower–followee graph with the nodes labeled as “benign” or “Sybil” [47]. They did not use machine learning at all: they used two structure-based methods, one was the Random Walk (RW)-based method and the other was the Loop Belief Propagation (LBP)-based method, applying a (different) local rule to every user. Even though we use the same raw dataset, our approach is quite different, and we still obtained essentially the same results.

Our algorithm begins by cleaning and preprocessing the raw data. Python code was used to reformat the raw data into a matrix that is suitable for input into the machine learning algorithms. We trained multiple machine learning models on our formatted data and evaluated their performance on validation data to identify the most effective model for this research problem. Figure 3 provides an overview of the deployed dataset.

We present here in the Algorithm 1 the pseudocode for the splitting of the data. This algorithm generates seventy one smaller data blocks from the dataset. It starts by reading the first line of the dataset and then skips seventy nodes. It captures each new node encountered after the skipping and adds it to a block. This process is repeated until the end of the dataset is reached, creating the first block. The algorithm then repeats all over again but now starting from the second line, beginning the creation of a second block, and continues this process until all seventy one blocks are generated.

Algorithm 1: Pseudocode for splitting the data

Algorithm 1 outlines the pseudocode for this initial part of the algorithm. After the seventy one data blocks are formed, a matrix is created from each block, similar to the one shown in Figure 4.

Next, we use these 71 matrices to train the SVM, KNN, and Random Forest algorithms. The first column, in green, represents the user ID. The remaining columns, in white, indicate the other users who follow the user from the first column. The last column, in yellow, labels users (from the first column) as ‘1’ if they are detected as Sybil or ‘0’ if they are considered legitimate users. This matrix includes all users and their connections.

The pseudocode for the creation of these 71 matrices is portrayed in Algorithm 2.

For example, the first row shows data for user 1, followed by 2, 7, 90, and 111, which are labeled as Sybil users. The second row corresponds to user 5, followed by user 7, which are legitimate users. The matrix continues to include all users in the dataset [47] and their connections. In the second-to-last column (on the right side of the table in Figure 4), an “X” indicates either the node with the highest index or zero if the node is not followed. This is denoted by

X_{i}

, where

X_{i} \in {0, M}

, and M is the value of the highest index among the sorted nodes before being included in the matrix (See Algorithm 2).

Algorithm 2: Pseudocode for the Creation of Matrices

3.1. ML Terminologies and Methodologies Applied in This Study

3.1.1. Machine Learning

ML enables learning like the human brain without explicit programming. For instance, training a computer to distinguish between cats and dogs using labeled pictures constitutes ML or training computers to recognize individual voices. Many AI-driven companies use ML emulating human intelligence by learning from the environment [48], having widespread applications spanning sectors from software to medical and military domains such as image detection, face recognition, and voice assistants like Siri [49] or Alexa [50]. Home security cameras employ ML for movement and face detection. Radiology employs it for diagnosing tumors. This study investigated various ML methods, supervised and unsupervised learning, to detail the algorithmic approach and findings.

3.1.2. Supervised Learning

ML algorithms are classified based on training methods, with one type being supervised learning, where algorithms learn by mapping inputs to outputs via a labeled dataset containing input–output pairs [51], i.e., the algorithm is trained on instances with known labels. Supervised learning effectiveness arises from having correct target outputs, while the algorithm continually refines its output by calculating a value from a loss function, a value that guides adjustments to the model’s weights for achieving the desired output. For instance, when predicting car prices via feature data, e.g., previous prices, model number, brand, mileage, and accident history, an algorithm predicts the car value via inputs like model, make, and mileage. Supervised learning, with broad applications like image recognition, natural language processing (NLP), and speech recognition, were employed in this study in supervised ML to distinguish Sybil from legitimate social media Twitter accounts via two key types: classification and regression. A classification problem concerns categorizing data into specified classes, e.g., classifying animal images into cats or dogs, or segregating individuals by age, either 20–30 or greater than 30. Such exemplary classification ML algorithm tasks determine the group where an instance belongs. Common classification algorithms are decision trees, support vector machines, and random forests, which are trained on labeled data to classify new data but the prerequisite is first choosing the right algorithm relying on factors like dataset size, training methods, and desired accuracy. Sahami et al. [52] tackles classification problems via a naïve Bayes classifier to distinguish spam from legitimate emails through automated filters for classifying inbox emails, to enhance user experience by excluding unwanted content. Unlike classification, where the ML algorithm predicts a predetermined class, regression predicts a value, e.g., car selling price or a person’s due date. Unlike supervised learning, unsupervised learning omits labeled data for training, instead analyzing unlabeled data, revealing patterns and relationships, being applicable in clustering and anomaly detection [53]. The K-nearest neighbors (KNN) algorithm, a flexible supervised method, learns from labeled data to predict labels for unlabeled instances. It memorizes patterns during training, then utilizes them for predicting new instance labels, serving both classification and regression tasks based on its training.To label a new instance, KNN compares memorized instances and assigns labels using similarities [54]. Next, this study illustrates real-world KNN functions for predicting housing price trends for investors’ decision making. Initially, when employing the KNN algorithm as a classifier, classes are distinguished, and the model’s accuracy is assessed via labeled instances by splitting the dataset for training and testing using various techniques, e.g., cross-validation, including resubstitution validation, K-fold cross-validation, and repeated K-fold cross-validation, each with its strengths and applications. Next, the KNN algorithm determines the number of neighbors (K) for predicting class, influencing accuracy. Optimal K selection balances pattern detection and sensitivity. Next, compute distances between the instance and its neighbors. The closest K neighbors’ classes decide the predicted class using distance metrics (e.g., Euclidean, Manhattan, Minkowski). Refer to Formula (1) for these methods [55]. Another distance-measuring method is cosine similarity, depicted in Formula (2), commonly used in text retrieval, as discussed by Lu [15]. Additionally, the KNN equation employs diverse distance formulas from Batchelor [18] to introduce the Minkowski distance Formula (3), the chi formula (Formula (4)), and the correlation equation (Formula (5)) by Michalski and Stepp [56], which constitute distinct distance metrics for KNN’s neighbor distance calculation.

d i s t (A, B) = \sqrt{\frac{\sum_{i = 1}^{m} {(x_{i} - y_{i})}^{2}}{m}}

(1)

s i m (A, B) = \frac{\vec{A} \cdot \vec{B}}{|\vec{A}| |\vec{B}|}

(2)

d i s t_M i n k o w s k y (A, B) = {(\sum_{i = 1}^{m} {|x_{i} - y_{i}|}^{r})}^{1 / r}

(3)

d i s t_C h i - s q u a r e (A, B) = \sum_{i = 1}^{m} \frac{1}{s u m_{i}} {(\frac{x_{i}}{s i z e_{Q}} - \frac{y_{i}}{s i z e_{I}})}^{2}

(4)

d i s t_c o r r e l a t i o n (A, B) = \frac{\sum_{i = 1}^{m} (x_{i} - μ_{i}) (y_{i} - μ_{i})}{\sqrt{\sum_{i = 1}^{m} {(x_{i} - μ_{i})}^{2} \sum_{i = 1}^{m} {(y_{i} - μ_{i})}^{2}}}

(5)

Each equation is used to calculate the distances between the instance under study and other instances in the dataset, leading to varying outputs and accuracy measurements. These aspects were explored in ongoing research projects. Medjahed et al. [57] investigated breast cancer diagnoses using KNN algorithms and their dependency on diverse distance metrics. Iswanto et al. [31] conducted comparable studies on the impact of these distance equations in stroke disease detection. After calculating the distances between the studied instance and all other dataset instances, the algorithm picks K-nearest neighbors, i.e., K determining the count of neighbors. For instance, with K set to 100, it chooses the closest 100 neighbors, with such neighbors bearing the smallest distances to the instance. However, this step can be computationally intensive, especially for large datasets, requiring distance calculations and comparison operations. Once the K-nearest neighbors are identified, the algorithm predicts the instance’s class based on their classes, a step called “Voting”, offering Weighted and Majority options. Weighted Voting assigns varying weights to classes, while Majority Voting treats all classes equally, so adaptability accommodates developers aiming to emphasize specific class attributes.

3.1.3. Support Vector Machine Algorithm (SVM)

SVM, a supervised learning algorithm, operates on labeled datasets, aiming to learn a hyperplane that separates data points into distinct classes [58]. Its training optimizes the hyperplane’s position for effective classification of new points [59]. When assessing a new point’s position relative to the hyperplane, SVM assigns it to an appropriate class. In Figure 5, the hyperplane visually separates red dots from green stars, with dashed lines indicating the distance between the hyperplane and the nearest support vectors [27,60]. Such support vector positioning during training determines the hyperplane’s position and margin size [40]. In this testing phase, the hyperplane functions as the class boundary [61], classifying new instances. SVM’s hyperplane is crucial for classification and regression tasks, predicting class or value for new instances, relying upon the new instance’s position with respect to the hyperplane [62,63].

Based on the hyperplane distance, the instance receives a specific class or value based on the problem type. For instance, in a classification problem, the trained SVM algorithm predicts the class of a new instance. After training and hyperplane positioning, the algorithm assigns an instance to a class. If the brown rhombus falls to the hyperplane’s left side, associated with green stars, the algorithm predicts it as a duck (assuming green stars are ducks and red dots are lions). Similarly, the purple triangle and pink hexagon on the hyperplane’s right side belong to the lion class. Consequently, the algorithm predicts them as lions. Even the pink hexagon, on the right margin, is classified as a lion as it is on the right-side, regardless of being on the margin. Margins are only deployed during training to optimize hyperplane positioning and are unused during testing or applications. SVM optimizes the hyperplane in training via margins to solve the optimization problem; hence, margins play no role in classifying new instances during testing. Hyperplanes take diverse shapes based on dataset and feature count. With one feature, they are points, e.g., grading students as pass/fail. Two features yield lines or shapes, three form planes, and M features create an M-dimensional hyperplane. Formulas (1)–(5) and Figure 5 depict such variations, showcasing blue and red circles representing classes, with the hyperplane depicted as a brown line or circle, as shown in Figure 6, which also depicts a three-dimensional hyperplane, distinct from the line and circle, where a hyperplane correlates with a dataset bearing three features, X, Y, and Z axes, when working with N features to allow the hyperplane to gain N dimensions, a concept easier to grasp when observing the one to the two and then finally to the three dimensions.

Implementing an SVM involves five steps:

Data Preparation: As SVM is supervised, labeled data are deployed for training. Instances must be associated with specific classes, where data collection takes various routes like downloading datasets from platforms like Kaggle [64] or customizing existing ones.
Feature Scaling: SVM gives weight to features based on values where scaling ensures fairness among features. While some features have a larger magnitude, maintaining equitable treatment prevents bias. Scaling methods like standardization and normalization are often employed.
Kernel Selection: SVM utilizes the kernel trick, mapping data to higher dimensions for easier classification. Kernels are of various types, suited to different data and applications. Optimal kernel choice impacts computational efficiency.
Training and Optimization: After the kernel is selected, training involves determining the hyperplane position. This necessitates solving an optimization problem, minimizing the cost function while considering margin and regularization factors. Different algorithms, e.g., SMO and quadratic programming can be deployed.
Testing: SVM can handle classification and regression. In classification, algorithm predictions are compared to known instances. For regression, SVM predicts values based on the point’s distance from the hyperplane.

3.1.4. Random Forest Algorithm

Random forest, an ensemble supervised method, combines small models to form a larger model for problem solving. Such an algorithm operates in stages: small models tackle subproblems, their solutions contribute to the final prediction through a voting system. The large model is a collection of trees, with each tree handling a subset of features, with the five steps characterizing the random forest algorithm:

Dataset preparation: Like prior algorithms, it requires labeled data for training with a dataset consisting of features linked to corresponding classes, e.g., car quality correlates with attributes like make, model, and price.
Decision tree: This is the creation and grouping of the data via a central tree consisting of smaller decision trees that predict outcomes via the dataset’s distinct portions without all features being related to target outputs to form a diverse set of trees, akin to people uniquely solving problems. Ensemble voting aggregates their solutions to determine the final prediction.
Trees’ feature selection: Different randomly chosen features are assigned to each tree, to curb overfitting.
Bootstrapping: This involves the newly created dataset’s data samples, generating variations for each tree to avoid uniformity. Bootstrapping introduces randomness, yielding a diverse array of perspectives on the problem, crucial to preventing overfitting (Figure 7).
Output Prediction: The algorithm predicts the output, either class for classification or value for regression. All decision trees contribute predictions, and a voting mechanism selects the majority class for classification or the average output for regression.

The random forest algorithm offers adjustable parameters, including tree count, bootstrap dataset percentages, and decision tree node count, catering to the application’s complexity [65].

4. Data Analysis

4.1. Steps, Results, and Validation

This section reviews the steps and results of this empirical study’s experiments using the dataset by Jethava and Rao [66], previously utilized by Patel and Mistry [16]. First, data were cleansed and reformatted so they could be used to train different ML algorithms to detect Sybil nodes. The Twitter user dataset consists of followers and followees, i.e., it indicates what accounts each account follows. Figure 3 signifies part of the Jethava and Rao [66] dataset, indicating that user ID 1 is following 4, 5, 7, etc., up to the end of the dataset.

Figure 8 depicts user 5’s connections 1, 4, 7, etc. A separate dataset segment consists of a two-row file containing Sybil and legitimate node IDs. This study carefully processed and restructured the dataset for input into its ML algorithms, accompanied by the matrix generation process using pseudocode, portrayed earlier in this article. This entailed crafting a comprehensive matrix where rows represented user IDs, followed by followers’ IDs in subsequent columns, and the last column indicated user legitimacy (0 for non-Sybil, 1 for Sybil), derived from ‘269640_test.txt’ dataset row 2 (Jethava and Rao [66]).

This study generated a single matrix for training ML algorithms, followed by experiments to optimize accuracy. Due to the dataset’s extensive size, a need arose for significant computational resources, e.g., supercomputers, but, lacking access to one, we split the data into seventy-one samples to generate corresponding smaller matrices for training and testing. Such an approach eased the computational demands, crucial due to the initial matrix’s massive 117 GB size. Each of the seventy-one matrices was approximately 299.62 MB, totaling 20.77 GB. Despite such a size reduction, the original data remained intact and, as a result, were organized into independent matrices, thus enhancing manageability and computational feasibility. Next was the ML algorithm phase in this study, where ML algorithms were deployed, i.e., KNN, SVM, and random forest algorithms, and the respective training was conducted to differentiate Sybil nodes from legitimate nodes on the previously mentioned seventy-one matrices. In the next section, this study reports the distinct accuracy outcomes of each algorithm. While training, a challenge was encountered when training the SVM on four matrices once extensive training times elapsed, though KNN and random forest trained successfully. Hence, this study opted to analyze the algorithms’ results on sixty-seven matrices omitting the four problematic ones. All ML algorithms utilized the aforementioned seventy-one matrices as the input for training. Segmented data blocks served for training and testing, yielding high accuracy. The algorithms produced a uniform output format, enabling direct comparison with their outputs represented by a 2 × 2 matrix (Figure 9).

In Figure 9, a common 2 × 2 output matrix is observed from the ML models and can be interpreted as follows:

[\begin{matrix} I_{00} & I_{01} \\ I_{10} & I_{11} \end{matrix}]

Element I₀₀: True positives for non-Sybil instances, a value representing legitimate nodes correctly detected by the algorithm as such.
Element I₀₁: False positives for Sybil instances, signifying nodes falsely labeled as Sybil by the algorithm. Ideally, this value is minimal or zero, preventing the mislabeling of legitimate nodes as Sybil.
Element I₁₀: Sybil nodes undetected by the algorithm. The general aim is to keep this count to a minimum to ensure Sybil detection.
Element I₁₁: Sybil instances’ true positives, indicating correctly identified Sybil nodes.

Consider the SVM algorithm’s output matrix (Figure 9), where I₀₀ = 719 represents the correctly detected legitimate nodes, I₀₁ = 0 means no false positives for Sybil nodes, which is desirable, I₁₀ = 2 indicates that two Sybil nodes went undetected that were labeled as legitimate. And, I₁₁ = 336 specifies the correct and accurate detection of Sybil nodes. The algorithm’s accuracy in this case was 99.81%, calculated by adding I₀₀ and I₁₁ and dividing it by the sum of all the elements in the matrix. The accuracy was computed for every algorithm after processing each matrix. This study encountered issues obtaining SVM outputs for some of the matrices; this is not uncommon [2,11,31,57]. In our study, we employed a linear kernel for the training phase; other kernel types might yield less or no issues with SVM computing.

4.2. Data Overview

This study compared the performance of various ML algorithms in terms of accuracy. Table 1 summarizes the results, including the mean value and standard deviation for each algorithm we examined. Table 1 reveals that the SVM algorithm achieved the highest accuracy, trailed by the random forest algorithm. KNN ranked last in accuracy. In terms of standard deviation, the random forest algorithm demonstrated the lowest value, with KNN following in second place, and the SVM algorithm exhibiting the highest standard deviation. The data presented are the output accuracy for the K-nearest neighbor, random forest, and SVM algorithms. The tables for these algorithm’s data are portrayed in Appendix A, Appendix B, and Appendix C, respectively.

Figure 10 and Figure 11 depict the KNN accuracy results for the different matrices. As depicted, the accuracy had a maximum of approximately 97% and a minimum of around 96%.

Figure 12 and Figure 13 depict the accuracy for the random forest algorithm and the SVM algorithm, respectively.

When comparing the performance of the different ML algorithms regarding accuracy, Table 1 depicts the results revealing the mean value and standard deviation for each examined ML algorithm, highlighting that the SVM algorithm had the highest accuracy, followed by the random forest algorithm. KNN ranked last in terms of accuracy. Regarding standard deviation, the random forest algorithm bears the lowest standard deviation; KNN ranks second, followed by the SVM algorithm, having the highest standard deviation. Figure 14, Figure 15, Figure 16 and Figure 17 depict the different performance outcomes of all the examined algorithms.

Furthermore, the KNN algorithm had high accuracy levels; however, these accuracy levels were relatively lower than those of the support vector machine algorithm and the random forest algorithm. The constant number K that was chosen was investigated alongside the dataset’s size. As depicted in Figure 18, KNN has a normal distribution with accuracies ranging between 96% and 97.5%, with 96.6 having the highest frequency of occurring. While training the three ML algorithms, the SVM algorithm needed a relatively long time compared with the random forest and the KNN algorithm. It is important to notice that the SVM algorithm was taking too long to train on the matrices [11,13,31,57], so we decided to skip these matrices. The average time the SVM took to train on the other matrices was about 2 min. The normal distribution graph for the support vector machine algorithm shown in Figure 19 indicates that the algorithm has a high probability of producing an accuracy value between 99.5% and 99.9% with a relatively low probability of having an accuracy between 97% and 99%.

During SCM training of the three ML algorithms, the SVM algorithm needed a relatively long time compared with the random forest and the KNN algorithm. It is noted that the SVM algorithm also took too long to train on the matrices for Medjahed et al. [57], Iswanto et al. [31], Margolin and Levine [13], and Misra et al. [11]. After several days of not finishing with the computation, we decided to skip these matrices. The average time the SVM took to train on the other matrices was about 2 min. The normal distribution graph for the support vector machine algorithm shown in Figure 19 indicates that the algorithm has a high probability of producing an accuracy value between 99.5% and 100% with a relatively low probability of having an accuracy between 97% and 99%. In the case of the Random Forest algorithm it has a high probability of producing a value with an accuracy between 99.1% and 99.4% as shown in Figure 20. Random forest had high accuracy rates and the lowest standard deviation value among the machine learning algorithms. Also, the time required to train the random forest algorithm was relatively short compared to the time required to train the support vector machine algorithm. The random forest algorithm had a normal distribution with an accuracy span ranging between 98.4% and 99.6%, with a high probability of accuracy between 99.0% and 99.5%. Mounica et al. [45] achieved an accuracy of 93%; their approach was only for wireless networks. Nevertheless, our algorithm works for wired and wireless networks with higher accuracy. Lu et al. [15] achieved an AUC value of 0.93 with their proposed Sybil detection method. Patel and Mistry [16] had an AUC value of 0.82 for a large Twitter dataset. Our accuracy results ranged from 96% (KNN) to 99% (SVM).

5. Conclusions, Limitations, and Future Recommendations

5.1. Conclusions

This study began with a literature review on Sybil attacks and their targeting of different types of networks. This review delved into the protection and prevention algorithms employed to secure networks vulnerable to Sybil attacks. It also included an overview of reputable systems and an outline of our research methodologies. This study concluded by detailing the steps taken, analyzing results, explaining research limitations, and proposing future research directions. In this study, we developed an algorithm to detect Sybil nodes in an online social network, specifically Twitter, using the number of direct connections as a criterion. This algorithm can function as part of a reputation system within online social networks to identify Sybil nodes. The algorithms yielded high detection accuracy rates, ranging between 96% and 97% for the KNN algorithm, 97% and 100% for the SVM algorithm, and 98.4% to 99.6% for the random forest algorithm. Notably, these algorithms exhibited reliability with minimal standard deviations, all under 0.0061%, with the random forest algorithm showing the lowest deviation of 0.0029%. These outcomes are attainable using standard workstation or desktop computers.

The algorithm developed in this study depends solely on connection data and their directions. In contrast, the approach by Patel and Mistry [16] necessitates more parameters to ascertain whether a node is a Sybil or not. Our accuracy results are notably high. Furthermore, the scalability of this algorithm is noteworthy. We efficiently divided large datasets into smaller matrices, enhancing computation speed and enabling parallel execution. This matrix division also significantly reduces the total size of the matrices used for training, contributing to improved performance and stability.

5.2. Limitations

One notable study limitation pertains to the inability to train the SVM algorithm on the aforementioned four matrices. Additionally, due to the lack of access to a supercomputer, the scholars of this study faced constraints in conducting broader testing.

5.3. Future Research Opportunities

In the future, there are several avenues that offer opportunities for investigation and exploration. These are the possible propositions that can direct future research directions for readers’ consideration: First, the study’s algorithms can get tested on a supercomputer using the complete, undivided matrix is of interest. Such an exploration aims to assess how results might be influenced when all data are consolidated into a single, large matrix. Second, future research endeavors could explore the extension or reconfiguration of the data to make them applicable to other ML algorithms. Third, by examining the application of this algorithm to various online social networks (OSNs) and wireless sensor networks (WSNs) could yield valuable insights. Finally, the possibility of reducing the variables to reduce the matrix size and minimize the algorithm’s memory footprint is a worthwhile pursuit.

Author Contributions

Conceptualization, J.A.C.-H. and A.N.A.-G.; methodology, J.A.C.-H. and A.N.A.-G.; software, J.A.C.-H. and A.N.A.-G.; validation, J.A.C.-H., M.S. and A.N.A.-G.; formal analysis, J.A.C.-H., A.N.A.-G. and M.S.; investigation, J.A.C.-H. and A.N.A.-G.; resources, J.A.C.-H., A.N.A.-G. and M.S.; data curation, J.A.C.-H. and A.N.A.-G.; writing—original draft preparation, J.A.C.-H. and M.S.; writing—review and editing, J.A.C.-H., M.S., A.N.A.-G., R.L.-A. and M.D.; visualization, J.A.C.-H., M.S. and A.N.A.-G.; supervision, J.A.C.-H. and A.N.A.-G.; project administration, J.A.C.-H. All authors have read and agreed to the published version of this manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code and data utilized in this research are available upon reasonable request. Please contact the corresponding author for access to the resources mentioned with a brief description of your intended use. We aim to respond to requests within a reasonable timeframe and may require additional information to ensure responsible data-sharing practices. We are committed to promoting transparency and reproducibility in our research practices, and, therefore, we encourage interested parties to reach out for access to the materials supporting this study. We believe in promoting open research and data sharing to facilitate reproducibility and advancement in the field.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

The following is the output from the machine learning algorithms. Each line consists of the file’s name (i.e., the matrix number), followed by the type of the machine learning algorithm being trained and tested. The algorithm’s name is followed by the four numbers inside each matrix of results and then by the accuracy value for each algorithm. The following is a detailed example explaining the first line in the data:

numbersArranged0.txt,Nearest Neighbors,699,0,38,320,0.96404919

numbersArranged0.txt: The file name of the first matrix number zero.
Nearest Neighbors: The type of machine learning algorithm being used, in this case, KNN.
699: The number of true positives for Sybil nodes.
0: The number of false positives.
38: The number of false negatives.
320: The number of true positives for legitimate nodes.
0.9640491958372753: This is the accuracy of KNN on the first matrix.

numbersArranged0.txt,Nearest Neighbors,699,0,38,320,0.9640491958372753
numbersArranged1.txt,Nearest Neighbors,699,0,36,322,0.9659413434247871
numbersArranged2.txt,Nearest Neighbors,699,0,36,322,0.965941434247871
numbersArranged3.txt,Nearest Neighbors,699,0,29,329,0.9725638599810785
numbersArranged4.txt,Nearest Neighbors,699,0,27,331,0.9744560075685903
numbersArranged5.txt,Nearest Neighbors,699,0,38,320,0.9640491958372753
numbersArranged6.txt,Nearest Neighbors,700,0,35,322,0.9668874172185431
numbersArranged7.txt,Nearest Neighbors,700,0,38,319,0.9640491958372753
80
81
numbersArranged8.txt,Nearest Neighbors,698,0,41,318,0.9612109744560076
numbersArranged9.txt,Nearest Neighbors,698,0,30,329,0.9716177861873226
numbersArranged10.txt,Nearest Neighbors,700,0,35,322,0.9668874172185431
numbersArranged11.txt,Nearest Neighbors,699,0,36,322,0.9659413434247871
numbersArranged12.txt,Nearest Neighbors,698,0,36,323,0.9659413434247871
numbersArranged13.txt,Nearest Neighbors,700,0,28,329,0.9735099337748344
numbersArranged14.txt,Nearest Neighbors,700,0,28,329,0.9735099337748344
numbersArranged15.txt,Nearest Neighbors,699,0,37,321,0.9649952696310312
numbersArranged16.txt,Nearest Neighbors,696,0,30,331,0.9716177861873226
numbersArranged17.txt,Nearest Neighbors,700,0,37,320,0.9649952696310312
numbersArranged18.txt,Nearest Neighbors,697,0,36,324,0.9659413434247871
numbersArranged19.txt,Nearest Neighbors,699,0,36,322,0.9659413434247871
numbersArranged20.txt,Nearest Neighbors,695,0,31,331,0.9706717123935666
numbersArranged21.txt,Nearest Neighbors,700,0,37,320,0.9649952696310312
numbersArranged22.txt,Nearest Neighbors,700,0,30,327,0.9716177861873226
numbersArranged23.txt,Nearest Neighbors,698,0,41,318,0.9612109744560076
numbersArranged24.txt,Nearest Neighbors,699,0,29,329,0.9725638599810785
numbersArranged25.txt,Nearest Neighbors,700,0,28,329,0.9735099337748344
numbersArranged26.txt,Nearest Neighbors,699,0,27,331,0.9744560075685903
numbersArranged27.txt,Nearest Neighbors,699,0,32,326,0.9697256385998108
numbersArranged28.txt,Nearest Neighbors,699,0,34,324,0.967833491012299
numbersArranged29.txt,Nearest Neighbors,698,0,31,328,0.9706717123935666
numbersArranged30.txt,Nearest Neighbors,697,0,39,321,0.9631031220435194
numbersArranged31.txt,Nearest Neighbors,697,0,40,320,0.9621570482497634
numbersArranged32.txt,Nearest Neighbors,699,0,27,331,0.9744560075685903
numbersArranged33.txt,Nearest Neighbors,699,0,27,331,0.9744560075685903
numbersArranged34.txt,Nearest Neighbors,697,0,31,329,0.9706717123935666
numbersArranged35.txt,Nearest Neighbors,699,0,38,320,0.9640491958372753
82
numbersArranged36.txt,Nearest Neighbors,700,0,33,324,0.9687795648060549
numbersArranged37.txt,Nearest Neighbors,700,0,29,328,0.9725638599810785
numbersArranged38.txt,Nearest Neighbors,695,0,40,322,0.9621570482497634
numbersArranged40.txt,Nearest Neighbors,699,0,35,323,0.9668874172185431
numbersArranged41.txt,Nearest Neighbors,699,0,36,322,0.9659413434247871
numbersArranged42.txt,Nearest Neighbors,699,0,36,322,0.9659413434247871
numbersArranged43.txt,Nearest Neighbors,700,0,34,323,0.967833491012299
numbersArranged44.txt,Nearest Neighbors,696,0,41,320,0.9612109744560076
numbersArranged45.txt,Nearest Neighbors,697,0,40,320,0.9621570482497634
numbersArranged46.txt,Nearest Neighbors,720,0,37,300,0.9649952696310312
numbersArranged47.txt,Nearest Neighbors,719,0,36,302,0.9659413434247871
numbersArranged48.txt,Nearest Neighbors,720,0,36,301,0.9659413434247871
numbersArranged49.txt,Nearest Neighbors,720,0,37,300,0.9649952696310312
numbersArranged50.txt,Nearest Neighbors,720,0,29,308,0.9725638599810785
numbersArranged51.txt,Nearest Neighbors,719,0,33,305,0.9687795648060549
numbersArranged52.txt,Nearest Neighbors,719,0,40,298,0.9621570482497634
numbersArranged53.txt,Nearest Neighbors,718,0,35,304,0.9668874172185431
numbersArranged56.txt,Nearest Neighbors,719,0,30,308,0.9716177861873226
numbersArranged58.txt,Nearest Neighbors,719,0,42,296,0.9602649006622517
numbersArranged59.txt,Nearest Neighbors,720,0,36,301,0.9659413434247871
numbersArranged60.txt,Nearest Neighbors,718,0,34,305,0.967833491012299
numbersArranged61.txt,Nearest Neighbors,719,0,31,307,0.9706717123935666
numbersArranged62.txt,Nearest Neighbors,720,0,34,303,0.967833491012299
numbersArranged63.txt,Nearest Neighbors,715,0,31,311,0.9706717123935666
numbersArranged64.txt,Nearest Neighbors,719,0,33,305,0.9687795648060549
numbersArranged65.txt,Nearest Neighbors,719,0,35,303,0.9668874172185431
numbersArranged66.txt,Nearest Neighbors,719,0,29,309,0.9725638599810785
numbersArranged67.txt,Nearest Neighbors,721,0,36,300,0.9659413434247871
83
numbersArranged68.txt,Nearest Neighbors,716,0,39,302,0.9631031220435194
numbersArranged69.txt,Nearest Neighbors,720,0,35,302,0.9668874172185431
numbersArranged70.txt,Nearest Neighbors,721,0,34,302,0.967833491012299

Appendix B

numbersArranged0.txt,Random Forest,692,7,2,356,0.9914853358561968
numbersArranged1.txt,Random Forest,694,5,3,355,0.9924314096499527
numbersArranged2.txt,Random Forest,697,2,5,353,0.9933774834437086
numbersArranged3.txt,Random Forest,695,4,4,354,0.9924314096499527
numbersArranged4.txt,Random Forest,693,6,7,351,0.9877010406811731
numbersArranged5.txt,Random Forest,691,8,2,356,0.9905392620624409
numbersArranged6.txt,Random Forest,693,7,1,356,0.9924314096499527
numbersArranged7.txt,Random Forest,695,5,2,355,0.9933774834437086
numbersArranged8.txt,Random Forest,691,7,6,353,0.9877010406811731
numbersArranged9.txt,Random Forest,696,2,8,351,0.9905392620624409
numbersArranged10.txt,Random Forest,697,3,1,356,0.9962157048249763
numbersArranged11.txt,Random Forest,695,4,2,356,0.9943235572374646
numbersArranged12.txt,Random Forest,695,3,8,351,0.9895931882686849
numbersArranged13.txt,Random Forest,695,5,4,353,0.9914853358561968
numbersArranged14.txt,Random Forest,696,4,2,355,0.9943235572374646
numbersArranged15.txt,Random Forest,693,6,3,355,0.9914853358561968
numbersArranged16.txt,Random Forest,691,5,8,353,0.9877010406811731
numbersArranged17.txt,Random Forest,699,1,1,356,0.9981078524124882
numbersArranged18.txt,Random Forest,695,2,6,354,0.9924314096499527
numbersArranged19.txt,Random Forest,698,1,5,353,0.9943235572374646
numbersArranged20.txt,Random Forest,692,3,8,354,0.9895931882686849
numbersArranged21.txt,Random Forest,697,3,1,356,0.9962157048249763
numbersArranged22.txt,Random Forest,697,3,3,354,0.9943235572374646
numbersArranged23.txt,Random Forest,698,0,5,354,0.9952696310312205
numbersArranged24.txt,Random Forest,695,4,2,356,0.9943235572374646
numbersArranged25.txt,Random Forest,697,3,5,352,0.9924314096499527
numbersArranged26.txt,Random Forest,698,1,7,351,0.9924314096499527
numbersArranged27.txt,Random Forest,697,2,4,354,0.9943235572374646
numbersArranged28.txt,Random Forest,696,3,5,353,0.9924314096499527
numbersArranged29.txt,Random Forest,696,2,5,354,0.9933774834437086
numbersArranged30.txt,Random Forest,693,4,5,355,0.9914853358561968
numbersArranged31.txt,Random Forest,691,6,8,352,0.9867549668874173
numbersArranged32.txt,Random Forest,695,4,4,354,0.9924314096499527
numbersArranged33.txt,Random Forest,695,4,3,355,0.9933774834437086
numbersArranged34.txt,Random Forest,695,2,4,356,0.9943235572374646
numbersArranged35.txt,Random Forest,696,3,4,354,0.9933774834437086
numbersArranged36.txt,Random Forest,698,2,3,354,0.9952696310312205
numbersArranged37.txt,Random Forest,698,2,2,355,0.9962157048249763
numbersArranged38.txt,Random Forest,690,5,7,355,0.988647114474929
numbersArranged40.txt,Random Forest,692,7,4,354,0.9895931882686849
numbersArranged41.txt,Random Forest,697,2,2,356,0.9962157048249763
numbersArranged42.txt,Random Forest,696,3,3,355,0.9943235572374646
numbersArranged43.txt,Random Forest,694,6,5,352,0.9895931882686849
numbersArranged44.txt,Random Forest,694,2,5,356,0.9933774834437086
numbersArranged45.txt,Random Forest,692,5,6,354,0.9895931882686849
numbersArranged46.txt,Random Forest,713,7,8,329,0.9858088930936613
numbersArranged47.txt,Random Forest,715,4,6,332,0.9905392620624409
numbersArranged48.txt,Random Forest,713,7,3,334,0.9905392620624409
numbersArranged49.txt,Random Forest,719,1,4,333,0.9952696310312205
numbersArranged50.txt,Random Forest,719,1,4,333,0.9952696310312205
numbersArranged51.txt,Random Forest,716,3,4,334,0.9933774834437086
numbersArranged52.txt,Random Forest,711,8,9,329,0.9839167455061495
numbersArranged53.txt,Random Forest,715,3,6,333,0.9914853358561968
numbersArranged56.txt,Random Forest,716,3,5,333,0.9924314096499527
numbersArranged58.txt,Random Forest,713,6,3,335,0.9914853358561968
numbersArranged59.txt,Random Forest,717,3,4,333,0.9933774834437086
numbersArranged60.txt,Random Forest,714,4,5,334,0.9914853358561968
numbersArranged61.txt,Random Forest,715,4,2,336,0.9943235572374646
numbersArranged62.txt,Random Forest,717,3,8,329,0.9895931882686849
numbersArranged63.txt,Random Forest,708,7,8,334,0.9858088930936613
numbersArranged64.txt,Random Forest,719,0,2,336,0.9981078524124882
numbersArranged65.txt,Random Forest,717,2,7,331,0.9914853358561968
numbersArranged66.txt,Random Forest,717,2,7,331,0.9914853358561968
numbersArranged67.txt,Random Forest,717,4,4,332,0.9924314096499527
numbersArranged68.txt,Random Forest,710,6,7,334,0.9877010406811731
numbersArranged69.txt,Random Forest,717,3,2,335,0.9952696310312205
numbersArranged70.txt,Random Forest,718,3,7,329,0.9905392620624409

Appendix C

numbersArranged0.txt,SVM,689,10,3,355,0.9877010406811731
numbersArranged1.txt,SVM,699,0,3,355,0.9971617786187322
numbersArranged2.txt,SVM,674,25,1,357,0.9754020813623463
numbersArranged3.txt,SVM,699,0,1,357,0.9990539262062441
numbersArranged4.txt,SVM,693,6,1,357,0.9933774834437086
numbersArranged5.txt,SVM,686,13,1,357,0.9867549668874173
numbersArranged6.txt,SVM,697,3,2,355,0.9952696310312205
numbersArranged7.txt,SVM,699,1,1,356,0.9981078524124882
numbersArranged8.txt,SVM,696,2,4,355,0.9943235572374646
numbersArranged9.txt,SVM,692,6,5,354,0.9895931882686849
numbersArranged10.txt,SVM,700,0,2,355,0.9981078524124882
numbersArranged11.txt,SVM,699,0,1,357,0.9990539262062441
numbersArranged12.txt,SVM,694,4,5,354,0.9914853358561968
numbersArranged13.txt,SVM,699,1,3,354,0.9962157048249763
numbersArranged14.txt,SVM,697,3,3,354,0.9943235572374646
numbersArranged15.txt,SVM,696,3,2,356,0.9952696310312205
numbersArranged16.txt,SVM,689,7,8,353,0.9858088930936613
numbersArranged17.txt,SVM,699,1,2,355,0.9971617786187322
numbersArranged18.txt,SVM,697,0,5,355,0.9952696310312205
numbersArranged19.txt,SVM,699,0,3,355,0.9971617786187322
numbersArranged20.txt,SVM,695,0,6,356,0.9943235572374646
numbersArranged21.txt,SVM,698,2,0,357,0.9981078524124882
numbersArranged22.txt,SVM,700,0,1,356,0.9990539262062441
numbersArranged23.txt,SVM,698,0,3,356,0.9971617786187322
numbersArranged24.txt,SVM,694,5,3,355,0.9924314096499527
numbersArranged25.txt,SVM,699,1,2,355,0.9971617786187322
numbersArranged26.txt,SVM,699,0,3,355,0.9971617786187322
numbersArranged27.txt,SVM,699,0,2,356,0.9981078524124882
numbersArranged28.txt,SVM,698,1,3,355,0.9962157048249763
numbersArranged29.txt,SVM,698,0,3,356,0.9971617786187322
numbersArranged30.txt,SVM,678,19,12,348,0.9706717123935666
numbersArranged31.txt,SVM,694,3,4,356,0.9933774834437086
numbersArranged32.txt,SVM,698,1,2,356,0.9971617786187322
numbersArranged33.txt,SVM,699,0,3,355,0.9971617786187322
numbersArranged34.txt,SVM,694,3,7,353,0.9905392620624409
numbersArranged35.txt,SVM,699,0,2,356,0.9981078524124882
numbersArranged36.txt,SVM,698,2,2,355,0.9962157048249763
numbersArranged37.txt,SVM,698,2,2,355,0.9962157048249763
numbersArranged38.txt,SVM,693,2,9,353,0.9895931882686849
numbersArranged40.txt,SVM,699,0,2,356,0.9981078524124882
numbersArranged41.txt,SVM,699,0,2,356,0.9981078524124882
numbersArranged42.txt,SVM,698,1,3,355,0.9962157048249763
numbersArranged43.txt,SVM,699,1,4,353,0.9952696310312205
numbersArranged44.txt,SVM,696,0,4,357,0.9962157048249763
numbersArranged45.txt,SVM,697,0,4,356,0.9962157048249763
numbersArranged46.txt,SVM,712,8,4,333,0.988647114474929
numbersArranged47.txt,SVM,719,0,2,336,0.9981078524124882
numbersArranged48.txt,SVM,720,0,1,336,0.9990539262062441
numbersArranged49.txt,SVM,716,4,3,334,0.9933774834437086
numbersArranged50.txt,SVM,720,0,1,336,0.9990539262062441
numbersArranged51.txt,SVM,717,2,2,336,0.9962157048249763
numbersArranged52.txt,SVM,717,2,2,336,0.9962157048249763
numbersArranged53.txt,SVM,718,0,3,336,0.9971617786187322
numbersArranged56.txt,SVM,714,5,7,331,0.988647114474929
numbersArranged58.txt,SVM,719,0,2,336,0.9981078524124882
numbersArranged59.txt,SVM,719,1,1,336,0.9981078524124882
numbersArranged60.txt,SVM,705,13,11,328,0.9772942289498581
numbersArranged61.txt,SVM,715,4,2,336,0.9943235572374646
numbersArranged62.txt,SVM,720,0,1,336,0.9990539262062441
numbersArranged63.txt,SVM,700,15,8,334,0.978240302743614
numbersArranged64.txt,SVM,700,19,2,336,0.9801324503311258
numbersArranged65.txt,SVM,718,1,2,336,0.9971617786187322
numbersArranged66.txt,SVM,719,0,4,334,0.9962157048249763
numbersArranged67.txt,SVM,721,0,0,336,1.0
numbersArranged68.txt,SVM,707,9,8,333,0.9839167455061495
numbersArranged69.txt,SVM,719,1,1,336,0.9981078524124882
numbersArranged70.txt,SVM,718,3,3,333,0.9943235572374646

References

Saxena, G.D.; Dinesh, G.; David, D.S.; Tiwari, M.; Tiwari, T.; Monisha, M.; Chauhan, A. Addressing the Distinct Security Vulnerabilities Typically Emerge on the Mobile Ad-Hoc Network Layer. NeuroQuantology 2023, 21, 169–178. [Google Scholar]
Manju, V. Sybil attack prevention in wireless sensor network. Int. J. Comput. Netw. Wirel. Mob. Commun. (IJCNWMC) 2014, 4, 125–132. [Google Scholar]
Mahesh, B. Machine learning algorithms—A review. Int. J. Sci. Res. (IJSR) 2020, 9, 381–386. [Google Scholar] [CrossRef]
Rahbari, M.; Jamali, M.A.J. Efficient detection of Sybil attack based on cryptography in VANET. arXiv 2011, arXiv:1112.2257. [Google Scholar] [CrossRef]
Chang, W.; Wu, J. A Survey of Sybil Attacks in Networks. Available online: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=97dd43eabe4789e39b8290cf43daa513483aa4c7 (accessed on 9 September 2024).
Balachandran, N.; Sanyal, S. A review of techniques to mitigate sybil attacks. arXiv 2012, arXiv:1207.2617. [Google Scholar]
Platt, M.; McBurney, P. Sybil in the haystack: A comprehensive review of blockchain consensus mechanisms in search of strong Sybil attack resistance. Algorithms 2023, 16, 34. [Google Scholar] [CrossRef]
Douceur, J.R. The sybil attack. In Peer-to-Peer Systems; Springer: Berlin/Heidelberg, Germany, 2002; pp. 251–260. [Google Scholar]
Tran, D.N.; Min, B.; Li, J.; Subramanian, L. Sybil-Resilient Online Content Voting. In Proceedings of the SDI’09: 6th USENIX Symposium on Networked Systems Design and Implementation, Boston, MA, USA, 22–24 April 2009; Volume 9, pp. 15–28. [Google Scholar]
Cárdenas-Haro, J.A.; Konjevod, G. Detecting sybil nodes in static and dynamic networks. In Proceedings of the On the Move to Meaningful Internet Systems, OTM 2010: Confederated International Conferences: CoopIS, IS, DOA and ODBASE, Hersonissos, Crete, Greece, 25–29 October 2010; Proceedings, Part II. pp. 894–917. [Google Scholar]
Misra, S.; Tayeen, A.S.M.; Xu, W. SybilExposer: An effective scheme to detect Sybil communities in online social networks. In Proceedings of the 2016 IEEE International Conference on Communications (ICC), Kuala Lumpur, Malaysia, 22–27 May 2016; pp. 1–6. [Google Scholar]
Fong, P.W. Preventing Sybil attacks by privilege attenuation: A design principle for social network systems. In Proceedings of the 2011 IEEE Symposium on Security and Privacy, Oakland, CA, USA, 22–25 May 2011; pp. 263–278. [Google Scholar]
Margolin, N.B.; Levine, B.N. Informant: Detecting sybils using incentives. In Proceedings of the International Conference on Financial Cryptography and Data Security, Scarborough, Trinidad and Tobago, 12–16 February 2007; pp. 192–207. [Google Scholar]
Du, W.; Deng, J.; Han, Y.S.; Varshney, P.K.; Katz, J.; Khalili, A. A pairwise key predistribution scheme for wireless sensor networks. ACM Trans. Inf. Syst. Secur. (TISSEC) 2005, 8, 228–258. [Google Scholar] [CrossRef]
Lu, H.; Gong, D.; Li, Z.; Liu, F.; Liu, F. SybilHP: Sybil Detection in Directed Social Networks with Adaptive Homophily Prediction. Appl. Sci. 2023, 13, 5341. [Google Scholar] [CrossRef]
Patel, S.T.; Mistry, N.H. A review: Sybil attack detection techniques in WSN. In Proceedings of the 2017 4th International Conference on Electronics and Communication Systems (ICECS), Coimbatore, India, 24–25 February 2017; pp. 184–188. [Google Scholar]
Almesaeed, R.; Al-Salem, E. Sybil attack detection scheme based on channel profile and power regulations in wireless sensor networks. Wirel. Netw. 2022, 28, 1361–1374. [Google Scholar] [CrossRef]
Batchelor, B.G. Pattern Recognition: Ideas in Practice; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
Kafetzis, D.; Vassilaras, S.; Vardoulias, G.; Koutsopoulos, I. Software-defined networking meets software-defined radio in mobile ad hoc networks: State of the art and future directions. IEEE Access 2022, 10, 9989–10014. [Google Scholar] [CrossRef]
Cui, Z.; Fei, X.; Zhang, S.; Cai, X.; Cao, Y.; Zhang, W.; Chen, J. A hybrid blockchain-based identity authentication scheme for multi-WSN. IEEE Trans. Serv. Comput. 2020, 13, 241–251. [Google Scholar] [CrossRef]
Saraswathi, R.V.; Sree, L.P.; Anuradha, K. Support vector based regression model to detect Sybil attacks in WSN. Int. J. Adv. Trends Comput. Sci. Eng. 2020, 9, 4090–4096. [Google Scholar]
Choy, G.; Khalilzadeh, O.; Michalski, M.; Do, S.; Samir, A.E.; Pianykh, O.S.; Geis, J.R.; Pandharipande, P.V.; Brink, J.A.; Dreyer, K.J. Current applications and future impact of machine learning in radiology. Radiology 2018, 288, 318–328. [Google Scholar] [CrossRef] [PubMed]
Tong, F.; Zhang, Z.; Zhu, Z.; Zhang, Y.; Chen, C. A novel scheme based on coarse-grained localization and fine-grained isolation for defending against Sybil attack in low power and lossy networks. Asian J. Control 2023. [Google Scholar] [CrossRef]
Nayyar, A.; Rameshwar, R.; Solanki, A. Internet of Things (IoT) and the digital business environment: A standpoint inclusive cyber space, cyber crimes, and cybersecurity. In The Evolution of Business in the Cyber Age; Apple Academic Press: New York, NY, USA, 2020; Volume 10, ISBN 9780429276484. [Google Scholar]
Alsafery, W.; Rana, O.; Perera, C. Sensing within smart buildings: A survey. ACM Comput. Surv. 2023, 55, 1–35. [Google Scholar] [CrossRef]
NBC News. NBC News Home. 2023. Available online: https://www.pewresearch.org/internet/fact-sheet/mobile/ (accessed on 28 February 2024).
Vincent, J. Emotion and the mobile phone. In Cultures of Participation: Media Practices, Politics and Literacy; Peter Lang: Lausanne, Switzerland, 2011; pp. 95–109. [Google Scholar]
Xiao, L.; Greenstein, L.J.; Mandayam, N.B.; Trappe, W. Channel-based detection of sybil attacks in wireless networks. IEEE Trans. Inf. Forensics Secur. 2009, 4, 492–503. [Google Scholar] [CrossRef]
Samuel, S.J.; Dhivya, B. An efficient technique to detect and prevent Sybil attacks in social network applications. In Proceedings of the 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), Coimbatore, India, 5–7 March 2015; pp. 1–3. [Google Scholar]
Arif, M.; Wang, G.; Bhuiyan, M.Z.A.; Wang, T.; Chen, J. A survey on security attacks in VANETs: Communication, applications and challenges. Veh. Commun. 2019, 19, 100179. [Google Scholar] [CrossRef]
Iswanto, I.; Tulus, T.; Sihombing, P. Comparison of distance models on K-Nearest Neighbor algorithm in stroke disease detection. Appl. Technol. Comput. Sci. J. 2021, 4, 63–68. [Google Scholar] [CrossRef]
Helmi, Z.; Adriman, R.; Arif, T.Y.; Walidainy, H.; Fitria, M. Sybil Attack Prediction on Vehicle Network Using Deep Learning. J. RESTI (Rekayasa Sist. Dan Teknol. Inf.) 2022, 6, 499–504. [Google Scholar] [CrossRef]
Ben-Hur, A.; Ong, C.S.; Sonnenburg, S.; Schölkopf, B.; Rätsch, G. Support vector machines and kernels for computational biology. PLoS Comput. Biol. 2008, 4, e1000173. [Google Scholar] [CrossRef]
Hu, L.Y.; Huang, M.W.; Ke, S.W.; Tsai, C.F. The distance function effect on k-nearest neighbor classification for medical datasets. SpringerPlus 2016, 5, 1–9. [Google Scholar] [CrossRef] [PubMed]
Lee, G.; Lim, J.; Kim, D.k.; Yang, S.; Yoon, M. An approach to mitigating sybil attack in wireless networks using zigBee. In Proceedings of the 2008 10th International Conference on Advanced Communication Technology, Phoenix Park, Republic of Korea, 17–20 February 2008; Volume 2, pp. 1005–1009. [Google Scholar]
Eschenauer, L.; Gligor, V.D. A key-management scheme for distributed sensor networks. In Proceedings of the 9th ACM Conference on Computer and Communications Security, Washington, DC, USA, 18–22 November 2002; pp. 41–47. [Google Scholar]
Dhamodharan, U.S.R.K.; Vayanaperumal, R. Detecting and preventing sybil attacks in wireless sensor networks using message authentication and passing method. Sci. World J. 2015, 2015, 841267. [Google Scholar] [CrossRef] [PubMed]
Ammari, A.; Bensalem, A. Fault Tolerance and VANET (Vehicular Ad-Hoc Network). Ph.D. Thesis, University of M’sila, M’sila, Algeria, 2022. [Google Scholar]
Quevedo, C.H.; Quevedo, A.M.; Campos, G.A.; Gomes, R.L.; Celestino, J.; Serhrouchni, A. An intelligent mechanism for sybil attacks detection in vanets. In Proceedings of the ICC 2020—2020 IEEE International Conference on Communications (ICC), Dublin, Ireland, 7–11 June 2020; pp. 1–6. [Google Scholar]
Yu, H.; Gibbons, P.B.; Kaminsky, M.; Xiao, F. Sybillimit: A near-optimal social network defense against sybil attacks. In Proceedings of the 2008 IEEE Symposium on Security and Privacy (sp 2008), Oakland, CA, USA, 18–21 May 2008; pp. 3–17. [Google Scholar]
Abbas, S.; Merabti, M.; Llewellyn-Jones, D.; Kifayat, K. Lightweight sybil attack detection in manets. IEEE Syst. J. 2012, 7, 236–248. [Google Scholar] [CrossRef]
Newsome, J.; Shi, E.; Song, D.; Perrig, A. The sybil attack in sensor networks: Analysis & defenses. In Proceedings of the 3rd International Symposium on Information Processing in Sensor Networks, Berkeley, CA, USA, 26–27 April 2004; pp. 259–268. [Google Scholar]
Chen, Y.; Yang, J.; Trappe, W.; Martin, R.P. Detecting and localizing identity-based attacks in wireless and sensor networks. IEEE Trans. Veh. Technol. 2010, 59, 2418–2434. [Google Scholar] [CrossRef]
Shetty, N.P.; Muniyal, B.; Anand, A.; Kumar, S. An enhanced sybil guard to detect bots in online social networks. J. Cyber Secur. Mobil. 2022, 11, 105–126. [Google Scholar] [CrossRef]
Mounica, M.; Vijayasaraswathi, R.; Vasavi, R. RETRACTED: Detecting Sybil Attack In Wireless Sensor Networks Using Machine Learning Algorithms. IOP Conf. Ser. Mater. Sci. Eng. 2021, 1042, 012029. [Google Scholar] [CrossRef]
Wang, B.; Zhang, L.; Gong, N.Z. SybilSCAR: Sybil detection in online social networks via local rule based propagation. In Proceedings of the IEEE INFOCOM 2017—IEEE Conference on Computer Communications, Atlanta, GA, USA, 1–4 May 2017; pp. 1–9. [Google Scholar] [CrossRef]
Twitter Follower-Followee Graph. Labeled with Benign/Sybil. 2022. Available online: https://figshare.com/articles/dataset/Twitter_follower-followee_graph_labeled_with_benign_Sybil/20057300 (accessed on 9 September 2024).
Demirbas, M.; Song, Y. An RSSI-based scheme for sybil attack detection in wireless sensor networks. In Proceedings of the 2006 International Symposium on a World of Wireless, Mobile and Multimedia Networks (WoWMoM’06), Buffalo, NY, USA, 26–29 June 2006; p. 5. [Google Scholar]
Machine LEarning Research. Apple’s Siri Voice Recognition Software. 2023. Available online: https://machinelearning.apple.com/research/hey-siri (accessed on 21 February 2024).
Alexa. Amazon’s Alexa Voice Recognition Software. 2023. Available online: https://developer.amazon.com/ (accessed on 21 February 2024).
Kak, S. A three-stage quantum cryptography protocol. Found. Phys. Lett. 2006, 19, 293–296. [Google Scholar] [CrossRef]
Sahami, M.; Dumais, S.; Heckerman, D.; Horvitz, E. A Bayesian approach to filtering junk e-mail. In Learning for Text Categorization: Papers from the 1998 Workshop; Citeseer: Madison, WI, USA, 1998; Volume 62, pp. 98–105. [Google Scholar]
Schiappa, M.C.; Rawat, Y.S.; Shah, M. Self-supervised learning for videos: A survey. ACM Comput. Surv. 2023, 55, 1–37. [Google Scholar] [CrossRef]
Zamsuri, A.; Defit, S.; Nurcahyo, G.W. Classification of Multiple Emotions in Indonesian Text Using The K-Nearest Neighbor Method. J. Appl. Eng. Technol. Sci. (JAETS) 2023, 4, 1012–1021. [Google Scholar] [CrossRef]
Gupta, M.; Judge, P.; Ammar, M. A reputation system for peer-to-peer networks. In Proceedings of the 13th International Workshop on Network and Operating Systems Support for Digital Audio and Video, Monterey, CA, USA, 1–3 June 2003; pp. 144–152. [Google Scholar]
Michalski, R.S.; Stepp, R.E.; Diday, E. A recent advance in data analysis: Clustering objects into classes characterized by conjunctive concepts. In Progress in Pattern Recognition; Elsevier: Amsterdam, The Netherlands, 1981; pp. 33–56. [Google Scholar]
Medjahed, S.A.; Saadi, T.A.; Benyettou, A. Breast cancer diagnosis by using k-nearest neighbor with different distances and classification rules. Int. J. Comput. Appl. 2013, 62, 1–5. [Google Scholar]
Swamynathan, G.; Almeroth, K.C.; Zhao, B.Y. The design of a reliable reputation system. Electron. Commer. Res. 2010, 10, 239–270. [Google Scholar] [CrossRef]
Valarmathi, M.; Meenakowshalya, A.; Bharathi, A. Robust Sybil attack detection mechanism for Social Networks-a survey. In Proceedings of the 2016 3rd International conference on advanced computing and communication systems (ICACCS), Coimbatore, India, 22–23 January 2016; Volume 1, pp. 1–5. [Google Scholar]
Vasudeva, A.; Sood, M. Survey on sybil attack defense mechanisms in wireless ad hoc networks. J. Netw. Comput. Appl. 2018, 120, 78–118. [Google Scholar] [CrossRef]
Yu, H.; Kaminsky, M.; Gibbons, P.B.; Flaxman, A. Sybilguard: Defending against sybil attacks via social networks. In Proceedings of the 2006 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Pisa, Italy, 11–15 September 2006; pp. 267–278. [Google Scholar]
Yuan, D.; Miao, Y.; Gong, N.Z.; Yang, Z.; Li, Q.; Song, D.; Wang, Q.; Liang, X. Detecting fake accounts in online social networks at the time of registrations. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, London, UK, 11–15 November 2019; pp. 1423–1438. [Google Scholar]
Zhang, K.; Liang, X.; Lu, R.; Shen, X. Sybil attacks and their defenses in the internet of things. IEEE Internet Things J. 2014, 1, 372–383. [Google Scholar] [CrossRef]
Kaggle. Level Up with the Largest AI & ML Community. 2023. Available online: https://www.kaggle.com/ (accessed on 25 February 2024).
Jain, N.; Jana, P.K. LRF: A logically randomized forest algorithm for classification and regression problems. Expert Syst. Appl. 2023, 213, 119225. [Google Scholar] [CrossRef]
Jethava, G.; Rao, U.P. User behavior-based and graph-based hybrid approach for detection of sybil attack in online social networks. Comput. Electr. Eng. 2022, 99, 107753. [Google Scholar] [CrossRef]

Figure 1. Sybil Attack on a P2P network.

Figure 2. RSS for Legit (A) and Sybil nodes (B) respectively.

Figure 3. Portion of adopted dataset.

Figure 4. An example of a matrix generated by our algorithm.

Figure 5. The hyperplane for SVM. Colors and shapes represent different items categories.

Figure 6. Two and three dimensional hyperplane lines. Colors represent different items categories.

Figure 7. Three different sample subsets (A–C) are generated from the main dataset bootstrapping to test against overfitting.

Figure 8. User ID number 5 and the user ID numbers that it follows.

Figure 9. Output from three ML algorithms on matrix number 47.

Figure 10. KNN accuracy values.

Figure 11. Line chart for KNN accuracy values with trends.

Figure 12. Random forest accuracy values.

Figure 13. SVM accuracy value line chart.

Figure 14. Accuracy values for SVM and random forest.

Figure 15. Accuracy values for KNN and support vector machine.

Figure 16. Accuracy values for KNN and random forest.

Figure 17. Accuracy values for KNN, SVM, and random forest.

Figure 18. Normal distribution graph for KNN algorithm.

Figure 19. Normal distribution graph for support vector machine algorithm.

Figure 20. Normal distribution graph for random forest algorithm.

Table 1. Mean value and standard deviation comparison.

	K Nearest Neighbor	Support Vector Machine	Random Forest
Mean Value	96.759	99.394	99.213
Std. Dev.	0.0038	0.0061	0.0029

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cárdenas-Haro, J.A.; Salem, M.; Aldaco-Gastélum, A.N.; López-Avitia, R.; Dawson, M. Enhancing Security in Social Networks through Machine Learning: Detecting and Mitigating Sybil Attacks with SybilSocNet. Algorithms 2024, 17, 442. https://doi.org/10.3390/a17100442

AMA Style

Cárdenas-Haro JA, Salem M, Aldaco-Gastélum AN, López-Avitia R, Dawson M. Enhancing Security in Social Networks through Machine Learning: Detecting and Mitigating Sybil Attacks with SybilSocNet. Algorithms. 2024; 17(10):442. https://doi.org/10.3390/a17100442

Chicago/Turabian Style

Cárdenas-Haro, José Antonio, Mohamed Salem, Abraham N. Aldaco-Gastélum, Roberto López-Avitia, and Maurice Dawson. 2024. "Enhancing Security in Social Networks through Machine Learning: Detecting and Mitigating Sybil Attacks with SybilSocNet" Algorithms 17, no. 10: 442. https://doi.org/10.3390/a17100442

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Security in Social Networks through Machine Learning: Detecting and Mitigating Sybil Attacks with SybilSocNet

Abstract

1. Introduction

1.1. Background and Motivation

1.2. Problem Statement and Research Question

1.3. This Paper’s Significance and Contribution

2. Theoretical Foundation and Sybil Attack Landscape

2.1. Sybil Attacks: Concepts, Core Characteristics, Varieties, and Impact on Network Integrity

2.2. Targeted Networks and Vulnerabilities

2.3. Unleashing Defense Strategies: Safeguarding against Sybil Attacks

2.4. Sybil Detection Algorithms

3. Methodology: Proposed SybilSocNet Algorithm

3.1. ML Terminologies and Methodologies Applied in This Study

3.1.1. Machine Learning

3.1.2. Supervised Learning

3.1.3. Support Vector Machine Algorithm (SVM)

3.1.4. Random Forest Algorithm

4. Data Analysis

4.1. Steps, Results, and Validation

4.2. Data Overview

5. Conclusions, Limitations, and Future Recommendations

5.1. Conclusions

5.2. Limitations

5.3. Future Research Opportunities

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix B

Appendix C

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI