1. Introduction
Blockchain was first proposed in 1991 to establish an encryption and information exchange system to address data security concerns [
1]. Bitcoin, as the first electronic cryptocurrency was emerged from the Blockchain features by Satoshi Nakamoto in 2008 [
2] and attracted the attention of governments around the world to use Bitcoin. Attractiveness and the amazing popularity of Bitcoin as a cryptocurrency have made Blockchain so popular. Blockchain has gained many enthusiasts in industry and academia and attracted the attention of many applications such as the Internet of Things [
3]. Due to this popularity, many cybercriminals and even real-world criminals (because of the anonymity of users) became interested in using Blockchain and Bitcoin [
4].
However, blockchains are not without drawbacks and limitations and is not completely immune to fraud, hack, attacks, and other malicious activities. The blockchain itself suffers from security issues. The security issues could be categorized into three levels, namely, the process level, the data level, and the infrastructure level. There exist many studies on how to incorporate different Blockchain technologies to enhance the security, transparency, and traceability of systems [
5].
Bitcoin users are always at risk of being hacked, and in addition to the enormous economic losses it causes to these users, it can also cause credit crises for commercial websites [
6,
7,
8,
9]. Due to this technology’s novelty, the developed security mechanisms for some systems do not yet exist, and there have been several hack attacks on digital currencies [
6]. Although Blockchain technology prevents fraudulent behavior, it cannot detect fraud on its own, so new innovative techniques and methods are needed to track attacks [
10]. The amazing attractiveness of Bitcoin, on the one hand, and the rise of cybercrime activity, on the other, have made it imperative to use anomaly detection for identifying potential scams.
One of the most important techniques for handling security issues is using anomaly detection. In data mining, anomaly detection is the identification of rare items, events, or observations that raise suspicions by differing significantly from the majority of the data. A collective anomaly refers to a group of data points that differ from the majority of the data, wherein a single data point is not treated as an anomaly [
11]. Although Blockchain technology prevents fraudulent behavior due to the type of structure, technology, and use of consensus algorithms, it cannot detect fraud on its own, and there may always be unpredictable ways to steal and defraud [
10]. Thus, new innovative techniques and methods are invented for handling attacks on Blockchain. The amazing attractiveness of Bitcoin, on the one hand, and the rise of cybercrime activity, on the other, have made it imperative to use anomaly detection to detect potential fraud.
This study intends to use collective anomaly detection (instead of point anomaly detection) on all one user’s wallets (instead of individual wallets) to remove features that have higher computational and operational capabilities. This approach reduces data size and helps to identify better abnormalities that have been intentionally used with multiple user wallets.
2. Previous Works
Much research has been carried out on cryptocurrency. For example, due to the attractiveness of the cryptocurrency, many studies have been carried out on its financial aspects, such as [
12,
13,
14,
15]. The main focus of the current study is on the anomaly in cryptocurrency and their architecture, namely Blockchain.
Initially, Blockchain was thought to be resistant to all kinds of attacks due to its cryptographic type and thanks to consensus algorithms, but security issues have prompted researchers to look for ways to detect anomalies in the blockchains. Several studies tackled the anomaly detection issue in Blockchain [
4,
7,
8,
9,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30,
31,
32,
33].
Table 1 summarizes the types of malicious attacks on blockchains and the tactics and potential strategies that can be used to confront them. As shown in
Table 1, anomaly detection methods can be used to detect the most malicious attacks. For example, using anomaly detection methods, bitcoin accounts of users who have used combinational services to engage in illegal activities or money laundering can be detected and tracked.
According to
Table 1, one of the diagnostic anomalies in the blockchains is countering the Record Hacking attack and detecting theft, hacking, fraud, which this study and the following works have considered:
Zambre et al. [
9] used six features and the K-Means algorithm to identify suspicious and rogue users and found a starting point for analyzing suspicious users. Pham et al. [
7] used three main social networking methods (power degree and densification laws, K-Means clustering, and local outlier factor) to diagnose anomalies. They were able to discover one of the 30 known cases of theft. In a subsequent study, Pham et al. [
21] used three unsupervised learning methods, including K-Means clustering, Mahalanobis distance, and unsupervised vector machine (SVM), and were able to identify a total of 3 of the 30 known cases.
Monamo et al. [
4] emphasize that anomaly detection plays an important role in data mining and considering that many remote locations have important information for further investigation, and in the Bitcoin network, diagnostics anomaly means detecting fraud, used the trimmed-Kmeans method. They successfully identified five of the addresses involved in 30 cases of theft, hacking, fraud, or loss.
Monamo et al. used the kd-trees algorithm instead of the Trimmed-Kmeans algorithm in the next study [
20] and were able to discover 7 of the target addresses, which were involved in 5 out of 30 cases of theft, hacking, fraud, or loss. (In many cases of theft, hacking, or fraud, thieves participated in multiple addresses and wallets to make it difficult to detect anomalies, leaving multiple addresses and wallets in each theft.)
In a study, Signorini et al. [
22] suggested using Fork instead of eliminating it to diagnose abnormalities. Chawathe [
16] further analyzes the method of Monamo et al. [
4] and recommends this method to detect anomalies in the blockchains. In addition to the method and algorithm used in the previous records, the subject of feature selection is also very important, which is compared in
Table 2. The ✓ character shows a column method that supports the feature in that row.
According to the results obtained in previous records, several works have been conducted, and in most of these methods, improvements in anomaly detection have been achieved by changing or adding new features or changing the algorithm, but in all methods, only the anomaly detection of wallet addresses has been sought. Additionally, if the user has multiple wallets and the behavior of each of these addresses seems normal, the previous methods will be somewhat inefficient. Since abnormal users mainly use multiple wallet addresses to normalize their behavior, it can be more efficient to choose a method that can examine the user’s behavior instead of the wallet address. In order to solve this problem, in addition to using the best features and algorithms in the previous records, the method of collective anomaly detection has been used to pay more attention to the anomaly detection of users with several wallet addresses.
3. Research Method
As mentioned in this research, the process of anomaly detection has been carried out with a collective anomaly approach. The details of the proposed method are described below:
3.1. Dataset and Theft List
In this research, the dataset of the “ELTE Bitcoin Project” [
34] has been used. This database includes the entire blockchain related to Bitcoin until 9 February 2016 and its basic version includes transactions until 28 December 2013. The basic database includes seven files: Block specification, transaction ID, Bitcoin Addresses, Block ID, Transaction output list, and Transaction input list. Each file has several features. Given that in previous works up to 7 April 2013, the Bitcoin Blockchain database had been examined, this study has also used these two datasets to date to examine the results more closely with previous works. The list of addresses that have committed theft, fraud, hacking, or loss was then extracted.
The following section provides a brief description of the features in
Table 2:
In-degree: Number of transactions received by a given user.
Out-degree: Number of transactions sent by a given user.
Unique in-degree: Number of unique users a given user has received transactions.
Unique out-degree: Number of unique users a given user has sent transactions.
Average in-transaction: Average number of bitcoins received per incoming transaction.
Average out-transaction: Average number of bitcoins sent per outgoing transaction.
Average time interval between in-transactions.
Average time interval between out-transactions.
Number of public keys owned by a given user.
Balance: Net number of bitcoins retained by the user.
Clustering coefficient: the measure of connectivity amongst neighbors of a given user.
Creation date: timestamp of the first transaction associated with a given user.
Active duration: time difference between first and most recent transactions associated with a given user.
Balance: Net number of bitcoins for a given transaction considering all in- and outgoing edges from that transaction.
Clustering coefficient: the measure of connectivity amongst neighbors of a given transaction.
Currency features: total amount sent, the total amount received, average amount sent, the average amount received, standard deviation received, standard deviation.
Creation date: timestamp of the first edge associated with a given transaction.
Active duration: time difference between first and most recent edges associated with a given transaction.
Network/graph features: in-degree, out-degree, clustering coefficient, number of triangles.
Average neighborhood (source–target) whereby concerning each query node: source refers to the origin on incoming transaction and target is the destination. The four features identified: in–in, in–out, out–out, out–in.
Average amount incoming: The average amount of bitcoins received to the address of the user’s wallet.
Average Amount outgoing: The average amount of bitcoins sent to the user wallet address.
Total amount sent: The total amount of bitcoins sent to the user’s wallet address.
Total amount received: The total amount of bitcoins received to the address of the user’s wallets.
Standard deviation received: The standard deviation of the number of bitcoins received to the address of the user’s wallets.
Standard deviation sent: The standard deviation of the number of bitcoins sent to the user’s wallet address.
Average neighborhood (In-in): The average neighborhood of inputs to inputs of all outputs.
Average neighborhood (In-out): The average neighborhood of inputs to outputs of all outputs.
Average neighborhood (Out-in): The average neighborhood of outputs to inputs of all outputs.
Average neighborhood (Out-out): The average neighborhood of outputs to outputs of all outputs.
3.2. Preprocessing
Since the best results in previous works are related to Monamo et al. [
20], and they also used the 14 features listed in
Table 2; in this study, investigations were performed on these features, and data preprocessing was conducted in the following three general steps:
Data wiping: Records that have no input or output are removed. Consequently, the number of records is reduced from 13,086,527 to 10,800,406.
Data aggregation and data size reduction: to detect collective anomalies in this research, using the Contraction feature, all the addresses, and wallets of a user are aggregated to extract the appropriate features according to this aggregation of data, and as a result, the number of records reached 5,305,678 records.
Considering that there is a computational relationship between the two in-degree and out-degree features, and according to the principle of data aggregation, two in-degree and out-degree features can be eliminated compared with the method of Monamo et al. [
20].
Because in many thefts, hacking, or fraud cases, the criminals work with multiple addresses and wallets to make it difficult to diagnose the anomaly; therefore, in the previous works that used the method of point anomaly detection, two important features of clustering coefficient and triangle were used to extract better results by realizing the multiplicity of connections between these addresses and wallets. On the other hand, according to the new approach of this research in diagnosing collective anomalies of users (with multiple addresses and possible wallets), instead of identifying point anomalies of addresses and wallets, two clustering coefficients and triangle features require high computational and operational power which can be removed for extraction.
Data conversion: The min–max linear method was used to normalize the data.
3.3. Feature Extraction
Table 3 shows the employed features of the proposed approach in the research along with a brief description of the features.
3.4. Trimmed K-Means Algorithm
Clustering is one of the famous techniques for anomaly detection because clustering potentially throws outlier data into a separate cluster. Among the clustering algorithms, K-means is one of the most popular algorithms. Although some authors, such as [
4], believe that K-means is not a technique for outlier detection, it lays the basis to evaluate methods given that outliers will be found furthest from the centroids of clusters they are associated with. Moreover, K-means inherits that lack of robustness from the mean. Instead of K-means, some researchers suggested an extended version of this algorithms is called trimmed K-means. The trimmed K-means is based on partial trimming that is more robust than classical K-means clustering in [
35]. The general approach of trimmed K-means is as follows:
The value of α at the input is specified to determine the percentage of outlier data, which is a number between 0 and 1. The K is the number of clusters in the input. A penalty function is denoted by Φ. For each set
A that
P(
A) ≥ 1 − α and any
k-set
M = m1, m2, …, mK in a vector space with d dimensions, the variation of
M given
A:
To obtain changes in the
k cluster, there is the following relation to minimize
M:
To obtain a cluster 0 by α percent of the dataset, there is the following relation to minimize
A:
The main purpose of the algorithm is to obtain a set of outlier data called
A0, and to obtain
k sets that fit inside each cluster, i.e.,
providing the following condition:
Briefly, in trimmed K-means, by observing the maximum O (1 – α) number of samples, the centers of the clusters can be determined. In this way, by selecting a subset of data, the centers of the clusters can be determined with reasonable accuracy. One of the most important features of this algorithm is to place α percent of the outlier, which is very far from other clusters’ centers, in the 0 cluster. This feature is particularly important in the case of the considered problem and anomaly detection.
Due to a large number of records and data dimensions and also the reduction in clustering time, Monamo et al. [
4] applied the clustering operation to one million records, but in the proposed method, due to data aggregation and size reduction, the experiment was conducted on all the records to extract more reliable results.
4. Findings
In this section, the results of the experiment are presented and compared with previous works. The proposed approach was run on a VPS Server DL380 G9 with 16 CPU core and 16GB RAM. We used MATLAB for implementation. The MATLAB FSDA toolbox [
36] was used for developing the algorithms.
4.1. Experimental Results
As shown in
Table 4, the proposed method uses the collective anomaly detection method for the first time compared with previous records. It succeeds in detecting anomalies of users who intend to show their behavior, usually by having multiple wallet addresses, and the proposed method was successful in detecting 14 users with 26 addresses involved in 9 cases of theft, fraud, hacking, or loss.
As shown in
Figure 1, the detected anomalous addresses are all in the 0 cluster, which is the same as the outliers and makes up exactly one percent of the total data.
4.2. Comparison with Previous Works
The following section presents comparisons of the current study with the previous works. The comparisons are based on features, employed algorithms, and the performance of the studies.
4.2.1. Comparison of Features
As shown in
Table 5, the proposed method is placed in the middle of the table in terms of the number of extracted features. At the same time, the proposed method does not use the clustering coefficient feature, the extraction of which has a high time complexity; therefore, the proposed method has acceptable performance in feature extraction in terms of computational and processing power.
4.2.2. Comparison of the Used Algorithms
As shown in
Table 6, the proposed method was able to detect anomalies using only one algorithm and had a proper performance in selecting the algorithm.
4.2.3. Comparison of Success of the Suggested Approach
As shown in
Table 7 and
Figure 2, the proposed method identified 26 of the anomalous addresses that were present in the nine detected anomalies, and in this respect, performed better than the previous works.
In terms of the number of features, the lowest number of features is related to Pham [
21]. The proposed method is in the middle of the comparison table. Because the proposed method uses the diagnosis of collective anomalies, a reduction in the number of records has been created, and in general, it has been successful in reducing the dimensions (number of records and features).
In terms of time and operational and computational power, the proposed method performed better than previous records that managed to detect anomalies.
In terms of the number of algorithms used to detect anomalies and suspicious transactions, the proposed method using the Trimmed_Kmeans algorithm has performed fine.
The most important part of comparing the proposed method with others is the success in performance and results. In this regard, the proposed method has been able to achieve the best performance compared with other methods and was able to detect 14 users with 26 addresses (wallets) who committed 9 cases of theft, fraud, hacking, or loss, and compared with Monamo’s latest method [
20], which was able to find 7 addresses (wallets) that committed 5 thefts, scams, hacks, or losses, has a much better performance.
Figure 3 shows how many frauds have been detected in each method, and how many wallets were involved in the frauds. It should be mentioned the number of detected users has been found only in the proposed method due to the new approach in the detection of anomalies that is user-centered.
The most important novelty of this research compared with previous methods is the use of collective anomaly detection (user) instead of individual anomalies (wallet address), i.e., instead of seeking anomaly detection in digital wallet addresses, it seeks to detect anomalies among the behavior of users who mainly use several digital wallet addresses. The advantages of the suggested approach are:
5. Conclusions and Suggestion
According to the results, it was found that people who intend to commit fraud and malicious activities in the Bitcoin network use several addresses and so-called digital wallets to normalize their activities as normal users. In a way, these users’ activity with multiple addresses makes them look almost like normal users. To diagnose this type of anomaly, such as an in-disguise anomaly, one must find a small deviation in these users’ behavior. In the previous works, anomaly detection was carried out by extracting new features that rely on the connection between a user’s digital wallets. However, in the proposed method, using collective anomaly detection, the user’s digital wallets are aggregated, and instead of detecting the anomaly of the wallet address, the anomaly of users who own one or more digital wallets was examined.
On the other hand, due to the significant expansion of this network, it becomes very difficult to extract features that depend on high power or computing time, and in practice, it seems very difficult to detect anomalies in this network with these methods. Therefore, in order to integrate and reduce the problem-solving dimensions of anomaly detection in Blockchain and Bitcoin networks, four features, two of which had high processing and computing power, were removed. The proposed method also uses the Trimmed_KMeans algorithm for clustering, which has a more robust method for solving anomaly detection problems than similar algorithms such as the KMeans algorithm. In the end, the proposed method was able to identify 14 users who had 26 known anomalous addresses. Thus, in comparison with the previous methods, in addition to reducing the dimensions of the problem from 10,800,406 records to 5,305,678 and also from 14 features to 10, the processing power and computational time of extracting each feature was also reduced. In addition, in the most important part of the evaluation and performance result, the number of detected thefts increased from five to nine compared with the previous best methods, and the number of addresses of the perpetrators of theft was increased from 7 to 26. Additionally, in this method, for the first time, 14 users who committed these cases were identified.
As future works, it is suggested to do new work in two parts in general. In one step, features and algorithms should be selected that require low computational and operational power to extract and execute. In another step, features and algorithms having the best diagnosis of the anomaly should be found.