Keywords

1 Introduction

The username password combination is one of the primary methods of authentication in most of the organizations portals. Many methods such as the man in the middle attack [3], DNS spoofing [6], and phishing attacks [16] are used to obtain username password combinations. All of these activities are examples of penetration attacks as they allow an attacker to intercept the connection and make them believe that they are on the right website [1]. In the aforementioned approaches, the user is fooled into giving their access credentials. Here, we analyze another type of attack, known as a brute force attack. In this approach, the attacker attempts to guess the username and password with the help of tools that make use of dictionaries of a username and password combinations. This approach leads to an increase in load on the server, which in turn block the actual user from logging in, this is an example of a denial of service attack. In the scenario in which, such an attack is distributed, it is an example of a distributed denial of service attack [5, 7, 13].

In this paper, we make use of Kippo honeypot [4, 10], which helps us log brute force attacks and help us understand the behavior patterns of the hacker. The hacker attempts to gain access with the help of a Secure Shell session. Here, we have made use of the data obtained from a honey pot deployed within the Information Security Lab of BITS Pilani, Hyderabad Campus [14, 17].

The primary reason for targeting SSH sessions is due to the fact that a significant number of servers are not well maintained and often make use of weak credentials which make a perfect target for malicious agents [12]. A preliminary analysis of credentials and passwords on SSH remote login servers from securehoney.net gave the following results (Table 1):

Table 1. Most common SSH usernames and passwords.

The primary motive of our research is to find out how data with respect to login credentials propagates [15], once a hacker has been successful in obtaining access to an SSH server. Figure 1 shows how successful attacks on the honey tend to be clustered around certain locations [9].

Fig. 1.
figure 1

From the above heatmap, we can see that most of the successful attacks seem to be stemming from North America, Europe, and Southeast Asia.

We also have an image that shows us a zoomed-in perspective in China, from which the majority of the attacks had originally originated. As we can see from the image it appears as if all the attacks appear in pockets, which lends some preliminary support to the hypothesis that data of the credentials appears to spread in the vicinity of the original successful attempt. In the remainder of the paper, we make use of a variety of clustering methods to catch patterns that may escape the human eye (Fig. 2).

Fig. 2.
figure 2

Distribution of login attempts from China.

2 Related Work

Babak Nabiyev in his work on the application of Clustering Techniques for the detection of DDoS attacks had made use of the KDD CUP 99 dataset which had been developed by DARPA. He attempted to differentiate between Normal Traffic and DDoS traffic with the help of K-Means and EM Clustering techniques. He had clubbed together six cases of DoS attacks as a single type and he defined normal traffic flow to be the other type of behavior. Consquently, he made use of these two classes for the final clustering analysis [8].

Shi Zhong also had made use of different clustering techniques for intrusion detection. In addition, he had also made use of the DARPA intrusion detection project for his dataset. Furthermore, he had done a comparative study on different clustering algorithms for intrusion detection, in which he concluded that unsupervised clustering algorithms performed better than supervised learning methods. Out of all the clustering algorithms, his proposed self-labeling heuristic performed the best with an overall accuracy of 93.6% [19].

Nikolskaia Kseniia analyzed IP traffic with the help of clustering on IP packet headers. He considered multiple parameters such as the classification parameters based on packet and transmission properties, choice of clustering methods and the number of clusters. He concluded that real-time data is too complex to dynamically change features or clustering algorithms. A hybrid neural network approach showed the best results with about 95% correctness [11].

Jie Wang argues that clustering algorithms may not work very properly for intrusion detection because the similarity level of data points cannot be controlled. He proposes a two seed expanding algorithm that splits the attacks into different phases. The preprocessing includes creating a network flow and changing continuous-valued features to binary features. Based on these features, the algorithm selects seeds until all flows are divided into clusters. Their experiments show that two seed expanding algorithm performs better than the k-means and other clustering methods [18].

Geoff Boeing used k-means clustering and dbscan techniques to cluster 1759 points of latitude and longitude data and they were reduced to 138 points and obtained 92% compression, without losing out on the key features of the information that had been spatially represented within the dataset [2].

3 Research Framework

Experimental Setup. We have deployed honey pots with the distributed architecture as shown in Fig. 3.

Fig. 3.
figure 3

The honey pot architecture which was used for the D-DoS Attack.

The hypervisor runs five virtual machines, each of which runs a mini-Ubuntu 16.04. Each instance, in turn, runs a different honeypot. The traffic to the virtual machines is controlled with the help of a firewall and Network Address Translation(NAT) to assist us to communicate with the outside world. The server runs within the Information Security Laboratory of BITS, Pilani-Hyderabad campus network. The server continuously monitors the activity that occurs on the public IP addresses (Table 2).

Table 2. Spec table of the honeypot used | Kippo.

4 Analysis

4.1 Attackers Origin

The origin of the attacker refers to the country or the city location from which the attack is being initiated. The source of their IP address help determines the location of the attacker. We made use of the urllib2 library to find the location of the attackers. However, IP addresses do not prove to be useful if the attacker makes use of a VPN or Tor Network. The results of the analysis have been mentioned in Table 3:

Table 3. Successful attempts city and country wise.
Fig. 4.
figure 4

Attempts distribution over 6 months

Table 4. Most popular passwords and number of attempts
Table 5. Most popular passwords and number of attempts

We observe that there seem to be clusters of activity followed by patches of inactivity as seen in Fig. 4. Here, we observe there as spikes of activity in the second week and the last week of June as well as the second week of July as well as the end of October and the beginning of November. On the other hand, there seem to be very less attacks initiated in the months of August and September and hence they were not accommodated in the graph.

4.2 Traffic Analysis

We had segmented the data into files of 1MB size and had a total of 250MB data. The configuration had allowed at most 21 attempts from a particular IP before the IP was banned. Total 870 usernames and 9027 unique passwords were attempted.

The most attempted username was “root” and the most attempted password was “admin”. In addition to the popular combination of ‘root’ and ‘admin’ we also get to see that the attackers tried other popular default passwords such as ubnt (as we made use of the Ubuntu operating system) as well as 1234, support and password. Furthermore, the hackers had also made use of popular usernames such as admin, user and guest. This analysis shows something as simple as setting a strong username password combination can reduce the number of successful breaches in security. Finally, we observe that an overwhelming majority of attacks on the distributed honeypot system appear to be coming from China (Tables 4, 5 and 6).

Table 6. Two Day of Interactions for the Ho Chi Minh City, Vietnam on 26th June and 27th June, 2018—Obtained by 2 g Clustering Approach
Fig. 5.
figure 5

Mean shift clustering (a) 1 g (b) 2 g (c) 3 g

4.3 Machine Learning Analysis

On this data, we have made use of three clustering methods which has helped us gain insight on the attacker’s profile after obtaining access to the system. Here, we have pooled the data in a manner that is similar to that used within n-gram models of Natural language processing. Thus, the data comes in three forms-

  • Single day data

  • Two days at a time

  • Three days at a time

We have made use of 3 different clustering algorithms to gain a better insight on the information presented through the data. From the Figs. 5, 6 and 7 we observe that most of the attacks seem to be concentrated only in certain parts of the world. This means that the information gained by the attacker seems to be spreading only to the vicinity to the earliest attack, rather than spreading randomly over the world.

Fig. 6.
figure 6

GMM clustering (a) 1 g (b) 2 g (c) 3 g

Fig. 7.
figure 7

Kmeans clustering (a) 1 g (b) 2 g (c) 3 g

All three techniques seem to give us the similar results-

  • All techniques give cluster centers which are very close to one another.

  • The cluster centers obtained are similar across 1 g, 2 g and 3 g

On the other hand there seem to be some key differences-

  • The mean shift algorithm appears to be more susceptible to outliers, which causes it to detect a greater number of clusters.

  • On the other hand, the algorithm behaves better when we increase the number of data points as in the case of 2 g and 3 g.

To better understand why the clustering algorithms have singled out these locations, we have probed the data from 1 g, 2 g and 3 g on specific geographic locations so as to search for patterns that could help us better understand how the attack seems to propogate.

Table 7. One Day of interaction for the Date 27th October, 2018 from China on - Obtained from the 1 g clustering approach

In the 1 g analysis for Table 7, we observe that all the successful attacks have appeared to have taken place one after another after short intervals of time. In addition, we can see that once an attacker gains access, it seems like the others in the vicinity gain access after a short interval of time.

In Table 8, we observe the following observation. The set of IP addresses that make a successful attempt on the first day are the same as those which are obtained on the following day. However, we notice that now there is a new IP from the same location that is now able to successfully gain access to the honeypot. This means either the attacker has gained access to a new IP or another attacker has received information about the same from another attacker in the same geolocation.

Table 8. Two day of interactions for the Ho Chi Minh City, Vietnam on 26th June and 27th June, 2018 | Obtained by 2 g clustering approach
Table 9. 3 Days of interactions for the country of Vietnam from 6th June to 8th June 2018 | Obtained by 3 g clustering approach

In Table 9, the pattern in the data obtained from the 3 g analysis further strengthens the observations that we had made in the case of 2 g. Here, we can clearly observe that the same set of IP addresses make attack in regular intervals of time. In addition, to those we see additional IP addresses which originate from the same or nearby locations which gives weight to the argument that the information about the credentials is spreading to the geographical vicinity.

5 Conclusion

We would like to draw the conclusion that attacks appear to be concentrated in certain regions. Furthermore, it appears as if the data with respect to the access credentials does not seem to spread randomly rather, it appears as if the success with respect to successful attacks seems to spread in the near vicinity of the first attack.