Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
193 views

Detection of Phishing URLs Using Machine Learning

This paper discusses methods for detecting phishing websites using machine learning techniques. It analyzes various features of benign and phishing URLs, including lexical features, host properties and page importance properties. Various data mining algorithms are evaluated to classify URLs and better understand phishing site structures. The tuned parameters help select the best machine learning algorithm to separate phishing and benign sites.

Uploaded by

Rokibul hasan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
193 views

Detection of Phishing URLs Using Machine Learning

This paper discusses methods for detecting phishing websites using machine learning techniques. It analyzes various features of benign and phishing URLs, including lexical features, host properties and page importance properties. Various data mining algorithms are evaluated to classify URLs and better understand phishing site structures. The tuned parameters help select the best machine learning algorithm to separate phishing and benign sites.

Uploaded by

Rokibul hasan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2013 International Conference on Control Communication and Computing (ICCC)

Detection of Phishing URLs Using Machine Learning


Techniques
Joby James Sandhya L. Ciza Thomas
SCT College of Engineering, SCT College of Engineering, College of Engineering,
Trivandrum. Trivandrum. Trivandrum.
jamesjoby@gmail.com lsandyaajith@gmail.com cizathomas@gmail.com

Abstract— Phishing costs Internet users billions of dollars and format of Hypertext Mark-up Language makes it very
per year. It refers to luring techniques used by identity easy to copy images or even an entire website. While this ease
thieves to fish for personal information in a pond of
unsuspecting internet users. Phishers use spoofed e-mail,
phishing software to steal personal information and
financial account details such as usernames and passwords.
This paper deals with methods for detecting phishing web
sites by analyzing various features of benign and phishing
URLs by Machine learning techniques. We discuss the
methods used for detection of phishing websites based on
lexical features, host properties and page importance
properties. We consider various data mining algorithms for Figure 1.Original facebook webpage
evaluation of the features in order to get a better
understanding of the structure of URLs that spread
phishing. The fine-tuned parameters are useful in selecting
the apt machine learning algorithm for separating the
phishing sites from benign sites.

Keywords—Phishing; benign; URL; Page rank; WHOIS

I. INTRODUCTION
Phishing is a criminal mechanism employing both social
engineering and technical tricks to steal consumers’ personal Figure2. Phishing webpage [4]
identity data and financial account credentials. Social
engineering schemes use spoofed e-mails, purporting to be of website creation is one of the reasons that the Internet has
from legitimate businesses and agencies, designed to lead grown so rapidly as a communication medium, it also permits
consumers to counterfeit websites that trick recipients into the abuse of trademarks, trade names, and other corporate
divulging financial data such as usernames and passwords. identifiers upon which consumers have come to rely as
Technical subterfuge schemes install malicious software onto mechanisms for authentication. Phisher then send the
computers, to steal credentials directly, often using systems to "spoofed" e-mails to as many people as possible in an attempt
intercept consumers’ online account user names and to lure them in to the scheme. When these e-mails are opened
passwords [1]. or when a link in the mail is clicked, the consumers are
Figure. 1 represents the webpage of the popular website redirected to a spoofed website, appearing to be from the
www.facebook.com. Figure. 2 represents a webpage similar to legitimate entity.
that of facebook, but is the webpage of a site which spreads
phishing activities. A user may misunderstand the second site B. Statistics of Phihing attacks
as genuine facebook site and provide his personal identity Phishing continues to be one of the rapidly growing classes
details. The Phisher can thus steal that information and he may of identity theft scams on the internet that is causing both short
use it for vicious purposes. term and long term economic damage. There have been nearly
33,000 phishing attacks globally each month in the year 2012,
A. The Technique of Phishing totalling a loss of $687 million [1].
The criminals, who want to obtain sensitive data, first An example of phishing occurred in June 2004. The Royal
create unauthorized replicas of a real website and e-mail, Bank of Canada notified customers that fraudulent e-mails
usually from a financial institution or another company that purporting to originate from the Royal Bank were being sent
deals with financial information. The e-mail will be created out asking customers to verify account numbers and personal
using logos and slogans of a legitimate company. The nature identification numbers (PINs) through a link included in the e-

978-1-4799-0575-1/13/$31.00 ©2013 IEEE 304


mail. The fraudulent e-mail stated that if the receiver did not methods. After evaluating the classifiers, a particular classifier
click on the link and key in his client card number and pass is selected and is implemented in MATLAB. The design flow
code, access to his account would be blocked. These e-mails is shown in Figure 3.
were sent within a week of a computer malfunction that
prevented customer accounts from being updated [2]. Collect Host based Lexical
The United States continued to be the top country hosting phishing & & page feature
Benign URLs based extraction
phishing sites during the third quarter of 2012. This is mainly feature
due to the fact that a large percentage of the world’s Web sites
and domain names are hosted in the United States. Financial
Services remains to be the most targeted industry sector by Implement Selection Evaluate features
Phishers [1]. the classifier of a using machine
Classifier learning methods
II. RELATED WORK
Figure3. Design flow graph
Many researchers have analyzed the statistics of suspicious
URLs in some way. Our approach borrows important ideas A. Collection of URLs
from previous studies. We review the previous work in the We collected URLs of benign websites from
phishing site detection using URL features that motivated our
www.alexa.com [9] www.dmoz.org [7] and personal web
own approach.
browser history. The phishing URLs were collected from
Ma et al. [3, 4] compared several batch-based learning www.phishtak.com [8]. The data set consists of 17000
algorithms for classifying phishing URLs and showed that the phishing URLs and 20000 benign URLs.
combination of host-based and lexical features results in the We obtained PageRank [10] of 240 benign websites and
highest classification accuracy. Also they compared the 240 phishing websites by checking PageRank individually at
performance of batch-based algorithms to online algorithms PR Checker [11].
when using full features and found that online algorithms, We collected WHOIS [12] information of 240 benign
especially Confidence-Weighted (CW), outperform batch- websites and 240 phishing websites.
based algorithms.
B. Host based analysis
The work by Garera et al. [5] uses logistic regression over
hand-selected features to classify phishing URLs. The features Host-based features explain “where” phishing sites are
include the presence of red flag keywords in the URL, features hosted, “who” they are managed by, and “how” they are
based on Google’s Page Rank, and Google’s Web page quality administered. We use these features because phishing Web
guidelines. It is difficult to make a direct comparison with our sites may be hosted in less reputable hosting centers, on
approach without access to the same URLs and features. machines that are not usual Web hosts, or through not so
reputable registrars.
McGrath and Gupta [6] did not construct a classifier, but
performs a comparative analysis of phishing and non phishing The block schematic for the host based analysis is shown
URLs with respect to datasets. They compared non phishing in Figure 4.
URLs drawn from the DMOZ Open Directory Project [7] to
phishing URLs from PhishTank [8]. The features they analyze Do Save the Analyse
‘WHOIS’ for
include IP addresses, WHOIS thin records containing date and Collect ‘WHOIS’
data in .txt ‘WHOIS’
registrar-provided information, geographic information, and Dataset query of
URL format features
lexical features of the URL such as length, character
distribution, and presence of predefined brand names [6].
III. PROBLEM OVERVIEW Figure4. Block diagram for host based analysis
URLs sometimes known as “Web links” are the primary The following are the properties of the hosts that are
means by which users locate information in the Internet. Our identified.
aim is to derive classification models that detect phishing web 1) WHOIS properties: WHOIS [12] properties gives
sites by analysis of the lexical and host-based features of details about the date of registration, update and expiry, who is
URLs. We analyze different classifying algorithms in Waikato the registrar and the registrant. If phishing sites are taken
Environment for Knowledge Analysis (WEKA) workbench down frequently, the registration dates will be newer than for
and MATLAB. legitimate sites.. A large number of phishing websites contain
IV. DESIGN FLOW
IP address in their hostname [5]. So getting the details of such
hostnames will be helpful in efforts to point to phishing sites,
The work consists of host based, page based and lexical which can be obtained from the Whois properties.
feature extraction of collected URLs and analysis. The first 2) Geographic properties: Geographic properties give
step is the collection of phishing and benign URLs. The host details about the continent/country/city to which the IP
based, popularity based and lexical based feature extractions
address belongs.
are applied to form a database of feature values. The database
is knowledge mined using different machine learning

305
3) Blacklist membership: A large percentage of phishing PR of all pages on the web changes every month when Google
URLs were present in blacklists. In the Web browsing context, does its re-indexing.
blacklists are precompiled lists or databases that contain IP The PageRanks form a probability distribution over web
addresses, domain names or URLs of malicious sites the web pages, so the sum of all web pages' PageRanks will be equal to
users should avoid. On the other hand white lists contain sites unity.
that are known to be safe.
b) Traffic Rank details: Traffic Ranks of websites
a) DNS-Based Blacklists: Users submit a query indicate a site’s popularity. Alexa.com ranks various websites
representing the IP address or the domain name in question to according to the Internet traffic based on previous 3 months.
the blacklist provider’s special DNS server, and the response Traffic close to 1 is accurate. Ranks more than 100,000 are not
is an IP address that represents whether the query was present so accurate since chance for error is high.
in the blacklist. SORBS, [13] URIBL [14], SURBL [15] and 5) Lexical feature analysis: Lexical features are the
Spamhaus [16] are examples of major DNS blacklist textual properties of the URL itself, not the content of the page
providers. it points to. URLs are human-readable text strings that are
b) Browser Toolbars: Browser toolbars provide a client- parsed in a standard way by client programs. Through a
side defense for users. Before a user visits a site, the toolbar multistep resolution process, browsers translate each URL into
intercepts the URL from the address bar and cross references a instructions that locate the server hosting the site and specify
where the site or resource is placed on that host. To facilitate
URL blacklist, which is often stored locally on the user’s this machine translation process, URLs have the following
machine or on a server that the browser can query. If the URL standard syntax.
is present on the blacklist, then the browser redirects the user
to a special warning screen that provides information about the <protocol>://<hostname><path>
threat. McAfee SiteAdvisor [17], Google Toolbar [18] and
WOT Web of Trust [19] are prominent examples of blacklist- An example of URL resolution is shown below:
backed browser toolbars.
Protocol Top Level domain
c) Network Appliances: Dedicated network hardware is Host name
another popular option for deploying blacklists. These
appliances serve as proxies between user machines within an https://accounts.google.com/
enterprise network and the rest of the Internet. As users within
an organization visit sites, the appliance intercepts outgoing ServiceLogin?service=mail&passive=true&rm=false&
connections and cross references URLs or IP addresses against continue=
a precompiled blacklist. IronPort acquired by Cisco in 2007 https://mail.google.com/mail/&ss=1&scc=1&ltmpl=de
and WebSense are examples of companies that produce fault&ltmplcache=2
blacklist backed network appliances.
Path
Limitations of blacklists: The primary advantage of
blacklists is that querying is a low overhead operation: the lists
of malicious sites are precompiled, so the only computational
cost of deployed blacklists is the lookup overhead. However, The <protocol> portion of the URL indicates which
the need to construct these lists in advance give rise to their network protocol should be used to fetch the requested
disadvantage that blacklists become stale. Network resource. The most common protocols in use are Hypertext
administrators block existing malicious sites, and enforcement Transport Protocol or HTTP (http), HTTP with Transport
efforts take down criminal enterprises behind those sites. Layer Security (https), and File Transfer Protocol (ftp).
There is a constant pressure on criminals to construct new The <hostname> is the identifier for the Web server on the
sites and to find new hosting infrastructure. As a result, new Internet. Sometimes it is a machine-readable Internet Protocol
malicious URLs are introduced and blacklist providers must (IP) address, but more often especially from the user’s
update their lists yet again. However, in this process, criminals perspective it is a human-readable domain name.
are always ahead because Web site construction is The <path> of a URL is analogous to the path name of a
inexpensive. Moreover, free services for blogs e.g., Blogger file on a local computer. The path tokens delimited by various
[20] and personal hosting e.g., Google Sites [21], Microsoft punctuation marks such as slashes, dots, and dashes, show
Live Spaces [22] provide another inexpensive source of how the site is organized. Criminals sometimes obscure path
disposable sites. tokens to avoid scrutiny, or they may deliberately construct
tokens to mimic the appearance of a legitimate site.
4) Page/Popularity Based Property: Popularity features The methodology used in our work to extract the lexical
indicate how popular a web page is among Internet users. features from the URL list is as follows: The URLs of
Various popularity features are as follows: legitimate websites, collected from alexa.com and dmoz.org,
a) PageRank [10]: It is one of the methods Google uses are written into a notepad and the file is saved in the computer.
to determine a page's relevance or importance. The maximum Then the MATLAB program is executed. It will ask for input

306
file. Feed the benign URL list to the MATLAB program. The likelihood estimation. It takes only one pass over the training
program processes the list and the feature list is obtained. The set and is computationally very fast.
decision vector ‘0’ is added. The list is saved in excel and csv 2) J48 decision tree: A decision tree is a predictive
format at location in the computer as specified in the program. machine-learning model that decides the target value
The same procedure is done for phishing URL list. The (dependent variable) of a new sample based on various
decision vector ‘1’ is added. The feature set comprises of host attribute values of the available data.
length, path length, number of slashes, number of path tokens
etc. The Figure 5 shows the flowchart of feature extraction. 3) K-NN: It is based on closest training examples in
the feature space. An object is classified by a majority vote of
its neighbors.
4) SVM: The SVM performs classification by finding
the hyper plane that maximizes the margin between two
classes. The vectors that define the hyper plane are the support
vectors.
The program flow for the classifier performance is shown
in Figure 6.

Start

Load the Excel datasheet when


prompted

Generate train.xls,trainresult.xls,
test.xls,testresult.xls files

Naive Baye's Regression


SVM KNN
classifier classifier

Performance analysis
Figure5. Flow chart for feature extraction

C. Machine learning algorithms Choose the suitable classifier


The evaluation of the various classifying algorithm is done
by using the workbench for data mining, Waikato
Training
Environment for Knowledge Analysis (WEKA) [22] and using Input URL Classify
MATLAB.
Four types of input data files i.e., Attribute Relation File Decision-whether phish or not
Format (.arff), Comma Separated Values (.csv), C4.5, binary
are allowed in WEKA. In our experiment .csv file format was
Figure 6. Program flow
used. The input file to the WEKA was obtained by a
MATLAB program by appending ‘YES’ in place of decision V. RESULTS
vector ‘1’ (phish) and ‘NO’ in place of decision vector ‘0’
The main findings of our preliminary work include:
(benign) of the dataset generated by MATLAB from input
URL list. The evaluation was done using percentage split • Phishing URLs and domains exhibit characteristics
60%. that are different from other URLs and domains.
The input to the classifiers in MATLAB is four .txt files
test.xls, testresult.xls, train.xls, trainresult.xls. • Phishing URLs and domain names have very
different lengths compared to other URLs and
The four machine learning algorithms considered for domain names in the Internet.
processing the feature set are:
1) Naive Bayes: Naive Bayes is a simple probabilistic • Many of the phishing URLs contained the name of
classifier based on applying Bayes' theorem (or Bayes's rule) the brand they targeted.
with strong independence (naive) assumptions. Parameter PageRank of benign and phishing websites were collected
estimation for Naïve Bayes models uses the maximum using Google PageRank Checker [11] and are presented in

307
Figure 7 and Figure 8. PageRank obtained for phishing sites We analyzed the prepared URL feature dataset using
are: Not Available, Non-Existing and 0. Naïve Bayes, J48 Decision Tree, k-NN, and SVM classifying
algorithms in WEKA. The percentage split is set to 60% i.e.,
The N/A pagerank (grey pagerank bar) might be due to 40 percentage of the dataset is taken as training data and 60
one of the following reasons [11]: percentage as test data. The performance is then evaluated
• Web page is new, and it is not indexed by Google yet. based on Confusion matrix, Detection Accuracy, True Positive
• Web page is indexed by Google, but it is not Rate and False Positive Rate. The result is tabulated in
ranked yet. TABLE 1.
• Web page was indexed by Google long ago, but it is The analysis of the dataset is done using MATLAB also by
recognized as a supplemental page. setting the above said testing conditions and is tabulated in
• Web page or the whole website is banned by Google. TABLE 2.
Supplemental Result is a URL residing in Google's When we check the Success Rate in analysis by WEKA
secondary database containing pages of less importance, as and MATLAB, it is seen that there are slight differences in
measured primarily by Google's PageRank algorithm. Google values. The J48 Decision Tree has the highest Success Rate
used to place a "Supplemental Result" label at the bottom of a compared to other selected classifying algorithms in WEKA.
search result to indicate that it is in the supplemental index; By using only the lexical features, we were able to achieve a
however in July 2007 they discontinued this practice and it is Detection Accuracy/Success rate of 93.2% for test split of
no longer possible to tell whether a result is in the 60%. When 90% of dataset is used, we got 93.78% Detection
supplemental index or the main one[11]. Accuracy. In MATLAB, using Regression Tree we got
PageRank for benign sites ranges from 0 to 9. We used 91.08% detection accuracy when using 60% of dataset for
240 benign URLs and 240 malicious URL sites for the plot. It testing and 85.63% detection accuracy when using 90% of
is inferred from the graph that the PageRank is pretty high for data for testing.
benign URLs compared to phishing websites. One exception is TABLE 1. Classifier Performance - WEKA
about newly registered websites. If we do the PageRank check
we will get ‘N/A’ (Not Available) message from the
PageRank Checker [11]. Test Confusion
Success Error
Classifier Rate Rate
options Matrix
(%) (%)
Number of Web Pages Vs PageRank
Naïve 4438 3578
250 68.60 31.40
Bayes 260 3945
200
7612 404
J48 93.20 6.80
150 Percentage
428 3777

100 Numbrer of split-60


7042 974
IBK 88.30 11.70
Web Pages 455 3750
50
0 SVM
7511 505
83.93 16.07
1459 2746

PageRank Naïve 1180 792


72.08 27.92
Bayes 61 1022
Figure 7.Number of phishing sites vs. PageRank
1883 89
J48 93.78 6.22
101 982
Percentage
Number of Web Pages Vs Pagerank split-90
1756 216
100 IBK 89.75 10.25
97 986
Number of Web Pages

80
1846 126
SVM 84.26 15.74
60 355 728

40
20
Figure 9 shows a comparison of TP Rate, FP Rate and
0 Detection Accuracy of SVM, Naïve Bayes, Regression Tree
and k-NN classifiers.
10.0
0.0
1
2
3
4
5
6
7
8
9

Pagerank Figure 10 shows detection accuracy parameters of the


classifiers with 60% and 90% test split.
Figure8. Number of benign sites Vs. PageRank

308
and the classifier makes the decision whether ‘Benign’ or
TABLE2. Classifier performance – MATLAB ‘Phish’ with its specified accuuracy.
Error
Test
Classifier
Confusion S
Success
Rate
VI. CO
ONCLUSION
Options Matrix Rate (%)
R
(%) Several features are compaared using various data mining
Naïve 7281 303 algorithms. The results pointts to the efficiency that can be
3633 4042 74.20 25.80
Bayes achieved using the lexical feattures. To protect end users from
Regression 10856 470 visiting these sites, we can tryy to identify phishing URLs by
Tree 1166 5839 91.08 8.92 analyzing their lexical and hoost-based features. A particular
Percentage
split-60
challenge in this domain iss that criminals are constantly
11299 3025 20.45
KNN 723 3284 79.55 making new strategies to couunter our defense measures. To
succeed in this contest, we need
n algorithms that continually
9871 806 adapt to new examples and feaatures of phishing URLs.
SVM 87.65 12.35
1082 3531
Online learning algorithhms provide better learning
Naïve 13648 1018 methods compared to batchh-based learning mechanisms.
2764 5500 83.50 16.50
Bayes Going forward we are interessted in various aspects of online
Regression 15082 999 learning and collecting data tot understand the new trends in
85.63 14.37 phishing activities such as fastt changing DNS servers.
Percentage Tree 2951 8465
split-90 16451 5080 24.23
KNN 75.77 REFE
ERENCES
1582 4384
[1] Phishing Trends Report for Q3
Q 2012, Anti Phishing Working
16416 5848 Group. http://antiphishing.orrg.
SVM 74.48 25.52
5 661 [2] Report on Phishing, Binationnal Working Group on Cross-
Border Mass Marketing Frauud, October 2006
[3] J. Ma, L. K. Saul, S. Savage,, and G. M. Voelker,” Beyond
Blacklists: Learning to Detect Phishing Web Sites from
120 Suspicious URLs”, Proc.of SIGKDD
S ’09.
100 TP Rate [4] J. Ma, L. K. Saul, S. Savage,, and G. M. Voelker, ”Learning to
80 Detect Phishing URLs”, AC CM Transactions on Intelligent
60 FP Rate Systems and Technology, Vol.
V 2, No. 3, Article 30, Publication
40
%

date: April 2011.


20 Accuracy [5] Garera S., Provos N., Chew M., Rubin A. D., “A Framework
0 for Detection and measurem ment of phishing attacks”, In
Proceedings of the ACM Woorkshop on Rapid Malcode
(WORM), Alexandria, VA.
[6] D. K. McGrath, M. Gupta, “Behind Phishing: An Examination
of Phisher Modi Operandi”, In Proceedings of the USENIX
Workshop on Large-Scale Exploits
E and Emergent Threats
(LEET).
Figure9. Detection parameeters [7] DMOZ Open Directory Projject. http://www.dmoz.org.
[8] PhishTank. http://www.phishhtank.com.
[9] The Web Information Comppany, www.alexa.com.
100
[10] I. Rogers, “Google Page Rannk – Whitepaper”,
80 [11] http://www.sirgroane.net/gooogle-page-rank/PR Checker,
60 http://www.prchecker.info/ccheck_page_rank.php
40 [12] WHOIS look up, www.whoiis.net, www.whois.com
60% Test [13] SORBS, Spam and Open-Reelay Blocking System,
20 www.sorbs.net
0 90 % Test [14] URIBL, URI blacklist, www w.uribl.com
[15] SURBL, www.surbl.org
[16] SPAMHAUS, www.spamhaaus.org
[17] McAfee site advisor, www.ssiteadvisor.com
[18] Google toolbar, www.toolbaar.google.com
[19] WOT Web of Trust. http://w www.mywot.com.
[20] BLOGGER, www.blogger.ccom
[21] Google sites, www.sites.gooogle.com
Figure10. Detection accuracy coomparison [22] Microsoft sites,www.microsoft.com/en/in/sitemap.aspx
Apart from that another experiment done d was to test [23] Data Mining with Open Souurce Machine Learning Software,
www.cs.waikato.ac.nz/ml/w weka/
whether an input URL is phish or not. Thee URL was loaded
[24] J. Han, M. Kamber, Data Mining:
M Concepts and Techniques,
into the MATLAB program and extractedd URL features. A Morgan Kaufmann Publisheers, Elsevier Inc., 2006.
feature set is created in .xls format. This iss used as test data

309

You might also like