Data Mining for Security Applications

Uploaded by

mohannad hosain

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

Data Mining for Security Applications

Uploaded by

mohannad hosain

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/221452043

Data Mining for Security Applications

Conference Paper · December 2008

DOI: 10.1109/EUC.2008.62 · Source: DBLP

CITATIONS READS
38 7,666

4 authors, including:

Bhavani M. Thuraisingham Mehedy Masud

University of Texas at Dallas United Arab Emirates University
231 PUBLICATIONS 5,733 CITATIONS 75 PUBLICATIONS 2,221 CITATIONS

SEE PROFILE SEE PROFILE

Kevin W. Hamlen
University of Texas at Dallas
98 PUBLICATIONS 2,847 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Using Deep Learning to Generate Relational HoneyData View project

Admin: Cyber Defense Review - Contribution and Work View project

All content following this page was uploaded by Mehedy Masud on 04 June 2014.

The user has requested enhancement of the downloaded file.

2008 IEEE/IFIP International Conference on Embedded and Ubiquitous Computing

Data Mining for Security Applications

Bhavani Thuraisingham, Latifur Khan, Mohammad M. Masud, Kevin W. Hamlen

The University of Texas at Dallas
{bhavani.thuraisingham, lkhan, mehedy, hamlen}@utdallas.edu

Abstract need to understand the types of threats. In [1] we

In this paper we discuss various data mining described real-time threats as well as non real-time
techniques that we have successfully applied for cyber threats. A real-time threat is a threat that must be acted
security. These applications include but are not limited upon within a limited time to prevent some
to malicious code detection by mining binary catastrophic situation. Note that non real-time threats
executables, network intrusion detection by mining can become real-time threats as new information is
network traffic, anomaly detection, and data stream uncovered. For example, one could suspect that a
mining. We summarize our achievements and current group of terrorists will eventually perform some act of
works at the University of Texas at Dallas on intrusion terrorism. However, if subsequent intelligence reveals
detection, and cyber-security research. that this act will likely occur before July 1, 2008, then
it becomes a real-time threat and we have to take
1. Introduction actions immediately. If the time bounds are tighter
such as “an attack will occur within two days” then we
Ensuring the integrity of computer networks, both cannot afford to make any mistakes in our response.
in relation to security and with regard to the There has been a lot of work on applying data
institutional life of the nation in general, is a growing mining for both national security and cyber security.
concern. Security and defense networks, proprietary Much of the focus of our previous paper was on
research, intellectual property, and data based market applying data mining for national security [1]. In this
mechanisms that depend on unimpeded and part of the paper we will discuss data mining for cyber
undistorted access, can all be severely compromised by security. In section 2 we will discuss data mining for
malicious intrusions. We need to find the best way to cyber security applications. In particular, we will
protect these systems. In addition we need techniques discuss threats to computers and networks and describe
to detect security breaches. applications of data mining to detect such threats and
Data mining has many applications in security attacks. Some of our current research at the University
including in national security (e.g., surveillance) as of Texas at Dallas will be discussed in section 3. The
well as in cyber security (e.g., virus detection). The paper is summarized in section 4.
threats to national security include attacking buildings
and destroying critical infrastructures such as power 2. Data Mining for Cyber Security
grids and telecommunication systems. Data mining
techniques are being used to identify suspicious 2.1. Overview
individuals and groups, and to discover which
individuals and groups are capable of carrying out This section discusses information related terrorism.
terrorist activities. Cyber security is concerned with By information related terrorism we mean cyber-
protecting computer and network systems from terrorism as well as security violations through access
corruption due to malicious software including Trojan control and other means. Malicious software such as
horses and viruses. Data mining is also being applied Trojan horses and viruses are also information related
to provide solutions such as intrusion detection and security violations, which we group into information
auditing. In this paper we will focus mainly on data related terrorism activities.
mining for cyber security applications. In the next few subsections we discuss various
To understand the mechanisms to be applied to information related terrorist attacks. In section 2.2 we
safeguard the nation’s computers and networks, we give an overview of cyber-terrorism and then discuss
insider threats and external attacks. Malicious

DOI 10.1109/EUC.2008.62

Authorized licensed use limited to: Univ of Texas at Dallas. Downloaded on May 05,2010 at 19:59:24 UTC from IEEE Xplore. Restrictions apply.
intrusions are the subject of section 2.3. Credit card
and identity theft are discussed in section 2.4. Attacks Targets of malicious intrusions include networks,
on critical infrastructures are discussed in section 2.5. web clients and servers, databases, and operating
Data mining for cyber security is discussed in section systems. Many cyber-terrorism attacks are due to
2.6. malicious intrusions. We hear much about of net-work
intrusions. What happens here is that intruders try to
2.2. Cyber-terrorism, Insider Threats, and tap into the networks and get the information that is
External Attacks being transmitted. These intruders may be human
intruders or automated malicious software set up by
Cyber-terrorism is one of the major terrorist threats humans. Intrusions can also target files instead of
posed to our nation today. As we have mentioned network communications. For example, an attacker can
earlier, this threat is exacerbated by the vast quantities masquerade as a legitimate user and use their
of information now available electronically and on the credentials to log in and access restricted files.
web. Attacks on our computers, networks, databases Intrusions can also occur on databases. In this case the
and the Internet infra-structure could be devastating to stolen credentials enable the attacker to pose queries
businesses. It is estimated that cyber-terrorism could such as SQL queries and access restricted data.
cause billions of dollars to businesses. A classic Essentially cyber-terrorism includes malicious
example is that of a banking information system. If intrusions as well as sabotage through malicious
terrorists attack such a system and deplete accounts of intrusions or otherwise. Cyber security consists of
funds, then the bank could loose millions and perhaps security mechanisms that attempt to provide solutions
billions of dollars. By crippling the computer system to cyber attacks or cyber terrorism. When discussing
millions of hours of productivity could be lost, which malicious intrusions or cyber attacks it is often helpful
is ultimately equivalent to direct monetary loss. Even a to draw analogies from the non cyber world—that is,
simple power outage at work through some accident non information related terrorism—and then translate
could cause several hours of productivity loss and as a those attacks to attacks on computers and networks.
result a major financial loss. Therefore it is critical that For example, a thief could enter a building through a
our information systems be secure. We discuss various trap door. In the same way, a computer intruder could
types of cyber-terrorist attacks. One is the propagation enter the computer or network through some sort of a
of malicious mobile code that can damage or leak trap door that has been intentionally built by a
sensitive files or other data; another is intrusions upon malicious insider and left unattended perhaps through
computer networks. careless design. Another example is a thief’s use of a
Threats can occur from outside or from the inside of stolen uniform to pass as a guard. The analogy here is
an organization. Outside attacks are attacks on an intruder masquerading as some-one else,
computers from someone outside the organization. We legitimately entering the system and taking all the
hear of hackers breaking into computer systems and information assets. Money in the real world would
causing havoc within an organization. Some hackers translate to information assets in the cyber world.
spread viruses that damage files in various computer Thus, there are many parallels between non-
systems. But a more sinister problem is that of the information related attacks and information related
insider threat. Insider threats are relatively well attacks. We can proceed to develop counter-measures
understood in the context of non-information related for both types of attacks.
attacks, but information related insider threats are often
overlooked or underestimated. People inside an 2.4. Credit Card Fraud and Identity Theft
organization who have studied the business’ practices
and procedures have an enormous advantage when We are hearing a lot these days about credit card
developing schemes to cripple the organization’s fraud and identity theft. In the case of credit card fraud,
information assets. These people could be regular an attacker obtains a person’s credit card and uses it to
employees or even those working at computer centers. make unauthorized purchases. By the time the owner
The problem is quite serious as some one may be of the card becomes aware of the fraud, it may be too
masquerading as someone else and causing all kinds of late to reverse the damage or apprehend the culprit. A
damage. In the next few sections we will examine how similar problem occurs with telephone calling cards. In
data mining can be leveraged to detect and perhaps fact this type of attack has happened to me personally.
prevent such attacks. Perhaps while I was making phone calls using my
2.3 Malicious Intrusions calling card at airports someone noticed the dial tones
and reproduced them to make free calls. This was my

586

Authorized licensed use limited to: Univ of Texas at Dallas. Downloaded on May 05,2010 at 19:59:24 UTC from IEEE Xplore. Restrictions apply.
company calling card. Fortunately our telephone patterns and behaviors. Link analysis may be used to
company detected the problem and informed my trace self-propagating malicious code to its authors.
company. The problem was dealt with immediately. Classification may be used to group various cyber
A more serious theft is identity theft. Here one attacks and then use the profiles to detect an attack
assumes the identity of another person by acquiring when it occurs. Prediction may be used to determine
key personal information such as social security potential future attacks depending in a way on
number, and uses that information to carry out information learnt about terrorists through email and
transactions under the other person’s name. Even a phone conversations. Also, for some threats non real-
single such transaction, such as selling a house and time data mining may suffice while for certain other
depositing the income in a fraudulent bank account, threats such as for network intrusions we may need
can have devastating consequences for the victim. By real-time data mining. Many researchers are
the time the owner finds out it will be far too late. It is investigating the use of data mining for intrusion
very likely that the owner may have lost millions of detection. While we need some form of real-time data
dollars due to the identity theft. mining, that is, the results have to be generated in real-
We need to explore the use of data mining both for time, we also need to build models in real-time. For
credit card fraud detection as well as for identity theft. example, credit card fraud detection is a form of real-
There have been some efforts on detecting credit card time processing. However, here models are usually
fraud (see [2]). We need to start working actively on built ahead of time. Building models in real-time
detecting and preventing identity thefts. remains a challenge. Data mining can also be used for
analyzing web logs as well as analyzing the audit
trails. Based on the results of the data mining tool, one
2.5. Attacks on Critical Infrastructures can then determine whether any unauthorized
intrusions have occurred and/or whether any
Attacks on critical infrastructures could cripple a unauthorized queries have been posed.
nation and its economy. Infrastructure attacks include Other applications of data mining for cyber security
attacking the telecommunication lines, the electric, include analyzing the audit data. One could build a
power, gas, reservoirs and water sup-plies, food repository or a warehouse containing the audit data and
supplies and other basic entities that are critical for the then conduct an analysis using various data mining
operation of a nation. tools to see if there are potential anomalies. For
Attacks on critical infrastructures could occur example, there could be a situation where a certain
during any type of attack whether they are non- user group may access the database between 3 and 5am
information related, information related or bio- in the morning. It could be that this group is working
terrorism attacks. For example, one could attack the the night shift in which case there may be a valid
software that runs the telecommunications industry and explanation. However if this group is working between
close down all the telecommunication lines. Similarly, say 9am and 5pm, then this may be an unusual
software that runs the power and gas supplies could be occurrence. Another example is when a person
attacked. Attacks could also occur through bombs and accesses the databases always between 1 and 2pm; but
explosives. That is, the telecommunication lines could for the last 2 days he has been accessing the database
be physically attacked. Attacking transportation lines between 1 and 2am. This could then be flagged as an
such as highways and railway tracks are also attacks unusual pattern that would need further investigation.
on infrastructures. Insider threat analysis is also a problem both from a
Infrastructures could also be attacked by natural national security as well from a cyber security
disaster such as hurricanes and earth quakes. Our main perspective. That is, those working in a corporation
interest here is the attacks on infrastructures through who are considered to be trusted could commit
malicious attacks, both information related and non- espionage. Similarly those with proper access to the
information related. Our goal is to examine data computer system could plant Trojan horses and
mining and related data management technologies to viruses. Catching such terrorists is far more difficult
detect and prevent such infrastructure attacks. than catching terrorists outside of an organization. One
may need to monitor the access patterns of all the
2.6. Data Mining for Cyber Security individuals of a corporation even if they are system
administrators to see whether they are carrying out
Data mining is being applied to problems such as cyber-terrorism activities [3], [4].
intrusion detection and auditing. For example, anomaly While data mining can be used to detect and
detection techniques could be used to detect unusual prevent cyber attacks, data mining also exacerbates

587

Authorized licensed use limited to: Univ of Texas at Dallas. Downloaded on May 05,2010 at 19:59:24 UTC from IEEE Xplore. Restrictions apply.
some security problems such as inference and privacy. algorithm that we have developed. We will describe
With data mining techniques one could infer sensitive this novel algorithm as well as our approach to
associations from the legitimate responses. For more combining it with SVM. In addition we will also
details on privacy we refer to [5], [6]. discuss our experimental results. For more details of
our research we refer to [7].
3. Our Current Research and Development Our other tools include those for email worm
detection, malicious code detection, buffer overflow
detection, botnet detection, and analysis of firewall
3.1 Data Mining for Intrusion and Malicious policy rules. For email worm detection we ex-amine
Code Detection emails and extract features such as “number of
attachments” and the train a data mining tools with
We are developing a number of tools that use data techniques such as SVM and Naïve Bayesian
mining for cyber security applications at the University classifiers to develop a model. Then we test the model
of Texas at Dallas, including tools for intrusion to determine whether the email has a virus/worm. We
detection, malicious code detection, and botnet use training and testing data sets posted on various
detection. An intrusion can be defined as any set of web sites [8]. For firewall policy rule analysis we use
actions that attempts to compromise the integrity, association rule mining techniques to determine
confidentiality, or availability of a resource. As whether there are any anomalies in the policy rule set
systems become more complex, there are always [9].
exploitable weaknesses due to design and Similarly, for malicious code detection we extract
programming errors, or through the use of various n-gram features both with assembly code and binary
“socially engineered” penetration techniques. code. We train the data mining tool with SVM and
Computer attacks are split into two categories, host- then test the model. The classifier then predicts
based attacks and network based attacks. Host-based whether the code is malicious. For buffer overflow
attacks target a machine and try to gain access to detection we assume that malicious messages contain
privileged services or resources on that machine. Host- code while normal messages contain data.
based detection usually uses routines to obtain system Distinguishing code from data is difficult on many
call data from an audit-process which tracks all system computing architectures such as Windows x86
calls made by each user-process. architectures because of variable-length instruction
Network-based attacks make it difficult for encodings, mixtures of code and data in each segment
legitimate users to access various network services by of the binary, and encrypted or compressed code
purposely occupying or sabotaging network resources segments. While these obstacles have impeded
and services. This can be done by sending large standard disassembly-based static analyses, we have
amounts of network traffic, exploiting well-known found success using SVM training and testing [10].
faults in networking services, overloading network
hosts, etc. Network-based attack detection uses 3.2. Data Mining for Botnet Detection
network traffic data (i.e., tcpdump) to look at traffic
addressed to the machines being monitored. Intrusion Our current research with the University of Illinois
detection systems are split into two groups: anomaly Urbana Champaign is focusing in applying data
detection systems and misuse detection systems. mining techniques for botnet detection. The term “bot”
Anomaly detection is the attempt to identify comes from the word robot. A bot is typically
malicious traffic based on deviations from established autonomous software capable of performing certain
normal network traffic patterns. Misuse detection is the functions. A botnet is a network of bots that are used
ability to identify intrusions based on a known pattern by a human operator or botmaster to carry out
for the malicious activity. These known patterns are malicious actions. Botnets are one of the most
referred to as signatures. Anomaly detection is capable powerful tools used in cyber-crime today, being
of catching new attacks. However, new legitimate capable of effecting distributed denial-of-service
behavior can also be falsely identified as an attack, attacks, phishing, spamming, and eavesdropping on
resulting in a false positive. The focus with the current remote computers. Often businesses, governments, and
state of the art is to reduce false negative and false individuals are facing million-dollar damages caused
positive rate. by hackers with the help of botnets. It is a major
We have used multiple models such as support challenge to the cyber-security research community to
vector machines (SVM). However we have improved combat this threat.
SVM a great deal by combining it with a novel

588

Authorized licensed use limited to: Univ of Texas at Dallas. Downloaded on May 05,2010 at 19:59:24 UTC from IEEE Xplore. Restrictions apply.
Botnets have different topologies and protocols. techniques for this purpose. In addition, we are
The most prevalent botnets use communications based enhancing the techniques we have developed to reduce
on Internet Relay Chat (IRC), and have a centralized false positive and false negatives. Furthermore, we are
architecture. There are many approaches available to exploring the applicability of our techniques to
detect and dismantle these IRC botnets. On the other distributed and pervasive environments.
hand, Peer-to-Peer (P2P) networks are a relatively new
technology used in botnets. P2P botnets use 5. References
decentralized P2P protocols to communicate among
the bots and the botmaster. These botnets are
distributed, having no central point of failure. As a [1] Thuraisingham, B., “Web Data Mining Technologies and
Their Applications in Business Intelligence and Counter-
result, these botnets are more difficult to detect and
terrorism”, CRC Press, FL, 2003.
destroy than the IRC botnets. Moreover, most of the
current research related to P2P botnets are in the [2] Chan, P, et al, “Distributed Data Mining in Credit Card
analysis phase. The main goal of our project is to Fraud Detection”, IEEE Intelligent Systems, 14 (6), 1999.
devise an efficient technique to detect P2P botnets. We
approach this problem from a data mining perspective. [3] Lazarevic, A., et al., “Data Mining for Computer Security
Applications”, Tutorial Proc. IEEE Data Mining Conference,
We are developing techniques to mine net-work traffic
2003.
for detecting P2P botnet traffic.
Our research on the botnet problem follows from [4] Thuraisingham, B., “Managing Threats to Web Databases
the important observation that network traffic (as well and Cyber Systems, Issues, Solutions and Challenges”,
as botnet traffic) is a continuous flow of data stream. Kluwer, MA 2004 (Editors: V. Kumar et al).
Conventional data mining techniques are not directly
applicable to stream data because of concept drift and [5] Thuraisingham B., “Database and Applications Security”,
CRC Press, 2005.
infinite-length. We propose a technique that can
efficiently handle both problems. Our main focus is to
[6] Thuraisingham B., “Data Miming, Privacy, Civil Liberties
adapt three major data mining techniques: and National Security”, SIGKDD Explorations, 2002.
classification, clustering, and outlier detection to
handle stream data. Our preliminary study on the [7] Khan, L., Awad, M. and Thuraisingham, B. “A New
development of new stream classification techniques Intrusion Detection System using Support Vector Machines
for P2P botnet detection has encouraging results. [11] and Hierarchical Clustering”, The VLDB Journal:
ACM/Springer-Verlag, 16(1), page 507-521, 2007.
4. Summary and Directions [8] Masud, M. M., Khan, L. and Thuraisingham, B. “Feature
based Techniques for Auto-detection of Novel Email
This paper has discussed data mining for security Worms”, In Proc. 11th Pacific-Asia Conference on
applications. We first started with a discussion of data Knowledge Discovery and Data Mining (PAKDD 2007),
mining for cyber security applications and then Nanjing, China, May 2007, page 205-216.
provided a brief overview of the tools we are
[9] Abedin, M., Nessa, S., Khan, L., Thuraisingham, B.,
developing. Data mining for national security as well
“Detection and Resolution of Anomalies in Firewall Policy
as for cyber security is a very active research area. Rules”, In Proc. 20th IFIP WG 11.3 Working Conference on
Various data mining techniques including link analysis Data and Applications Security (DBSec 2006), Springer-
and association rule mining are being explored to Verlag, July 2006, Sophia Antipolis, France, page 15-29.
detect abnormal patterns. Because of data mining,
[10] Masud, M. M., Khan, L, Thuraisingham, B., Wang, X.,
users can now make all kinds of correlations. This also
Liu, P., and Zhu, S., “A Data Mining Technique to Detect
raises privacy concerns. Remote Exploits”, In Proc. IFIP WG 11.9 International
One of the areas we are exploring for future Conference on Digital Forensics, Japan, Jan 27-30, 2008.
research is active defense. Here we are investigating
ways to monitor the adversaries. For such monitoring [11] Masud, M. M., Gao, J., Khan, L., Han, J.,
Thuraisingham, B., “Peer to Peer Botnet Detection for
to be effective, the monitor must avoid detection by the
Cyber-Security: A Data Mining Approach”. In Proc. Cyber
static and dynamic analyses employed by standard Security and Information Intelligence Research Workshop
anti-malware packages. We are therefore developing (CSIIRW 08), Oak Ridge National Laboratory, Oak Ridge,
techniques that can dynamically adapt to new detection TN, May 12-14, 2008.
strategies and continue to monitor the adversary. We
are exploring the use of adaptive machine learning

589

Authorized licensed use limited to: Univ of Texas at Dallas. Downloaded on May 05,2010 at 19:59:24 UTC from IEEE Xplore. Restrictions apply.
View publication stats