Detection of Phishing URLs Using Machine Learning
Detection of Phishing URLs Using Machine Learning
Abstract— Phishing costs Internet users billions of dollars and format of Hypertext Mark-up Language makes it very
per year. It refers to luring techniques used by identity easy to copy images or even an entire website. While this ease
thieves to fish for personal information in a pond of
unsuspecting internet users. Phishers use spoofed e-mail,
phishing software to steal personal information and
financial account details such as usernames and passwords.
This paper deals with methods for detecting phishing web
sites by analyzing various features of benign and phishing
URLs by Machine learning techniques. We discuss the
methods used for detection of phishing websites based on
lexical features, host properties and page importance
properties. We consider various data mining algorithms for Figure 1.Original facebook webpage
evaluation of the features in order to get a better
understanding of the structure of URLs that spread
phishing. The fine-tuned parameters are useful in selecting
the apt machine learning algorithm for separating the
phishing sites from benign sites.
I. INTRODUCTION
Phishing is a criminal mechanism employing both social
engineering and technical tricks to steal consumers’ personal Figure2. Phishing webpage [4]
identity data and financial account credentials. Social
engineering schemes use spoofed e-mails, purporting to be of website creation is one of the reasons that the Internet has
from legitimate businesses and agencies, designed to lead grown so rapidly as a communication medium, it also permits
consumers to counterfeit websites that trick recipients into the abuse of trademarks, trade names, and other corporate
divulging financial data such as usernames and passwords. identifiers upon which consumers have come to rely as
Technical subterfuge schemes install malicious software onto mechanisms for authentication. Phisher then send the
computers, to steal credentials directly, often using systems to "spoofed" e-mails to as many people as possible in an attempt
intercept consumers’ online account user names and to lure them in to the scheme. When these e-mails are opened
passwords [1]. or when a link in the mail is clicked, the consumers are
Figure. 1 represents the webpage of the popular website redirected to a spoofed website, appearing to be from the
www.facebook.com. Figure. 2 represents a webpage similar to legitimate entity.
that of facebook, but is the webpage of a site which spreads
phishing activities. A user may misunderstand the second site B. Statistics of Phihing attacks
as genuine facebook site and provide his personal identity Phishing continues to be one of the rapidly growing classes
details. The Phisher can thus steal that information and he may of identity theft scams on the internet that is causing both short
use it for vicious purposes. term and long term economic damage. There have been nearly
33,000 phishing attacks globally each month in the year 2012,
A. The Technique of Phishing totalling a loss of $687 million [1].
The criminals, who want to obtain sensitive data, first An example of phishing occurred in June 2004. The Royal
create unauthorized replicas of a real website and e-mail, Bank of Canada notified customers that fraudulent e-mails
usually from a financial institution or another company that purporting to originate from the Royal Bank were being sent
deals with financial information. The e-mail will be created out asking customers to verify account numbers and personal
using logos and slogans of a legitimate company. The nature identification numbers (PINs) through a link included in the e-
305
3) Blacklist membership: A large percentage of phishing PR of all pages on the web changes every month when Google
URLs were present in blacklists. In the Web browsing context, does its re-indexing.
blacklists are precompiled lists or databases that contain IP The PageRanks form a probability distribution over web
addresses, domain names or URLs of malicious sites the web pages, so the sum of all web pages' PageRanks will be equal to
users should avoid. On the other hand white lists contain sites unity.
that are known to be safe.
b) Traffic Rank details: Traffic Ranks of websites
a) DNS-Based Blacklists: Users submit a query indicate a site’s popularity. Alexa.com ranks various websites
representing the IP address or the domain name in question to according to the Internet traffic based on previous 3 months.
the blacklist provider’s special DNS server, and the response Traffic close to 1 is accurate. Ranks more than 100,000 are not
is an IP address that represents whether the query was present so accurate since chance for error is high.
in the blacklist. SORBS, [13] URIBL [14], SURBL [15] and 5) Lexical feature analysis: Lexical features are the
Spamhaus [16] are examples of major DNS blacklist textual properties of the URL itself, not the content of the page
providers. it points to. URLs are human-readable text strings that are
b) Browser Toolbars: Browser toolbars provide a client- parsed in a standard way by client programs. Through a
side defense for users. Before a user visits a site, the toolbar multistep resolution process, browsers translate each URL into
intercepts the URL from the address bar and cross references a instructions that locate the server hosting the site and specify
where the site or resource is placed on that host. To facilitate
URL blacklist, which is often stored locally on the user’s this machine translation process, URLs have the following
machine or on a server that the browser can query. If the URL standard syntax.
is present on the blacklist, then the browser redirects the user
to a special warning screen that provides information about the <protocol>://<hostname><path>
threat. McAfee SiteAdvisor [17], Google Toolbar [18] and
WOT Web of Trust [19] are prominent examples of blacklist- An example of URL resolution is shown below:
backed browser toolbars.
Protocol Top Level domain
c) Network Appliances: Dedicated network hardware is Host name
another popular option for deploying blacklists. These
appliances serve as proxies between user machines within an https://accounts.google.com/
enterprise network and the rest of the Internet. As users within
an organization visit sites, the appliance intercepts outgoing ServiceLogin?service=mail&passive=true&rm=false&
connections and cross references URLs or IP addresses against continue=
a precompiled blacklist. IronPort acquired by Cisco in 2007 https://mail.google.com/mail/&ss=1&scc=1<mpl=de
and WebSense are examples of companies that produce fault<mplcache=2
blacklist backed network appliances.
Path
Limitations of blacklists: The primary advantage of
blacklists is that querying is a low overhead operation: the lists
of malicious sites are precompiled, so the only computational
cost of deployed blacklists is the lookup overhead. However, The <protocol> portion of the URL indicates which
the need to construct these lists in advance give rise to their network protocol should be used to fetch the requested
disadvantage that blacklists become stale. Network resource. The most common protocols in use are Hypertext
administrators block existing malicious sites, and enforcement Transport Protocol or HTTP (http), HTTP with Transport
efforts take down criminal enterprises behind those sites. Layer Security (https), and File Transfer Protocol (ftp).
There is a constant pressure on criminals to construct new The <hostname> is the identifier for the Web server on the
sites and to find new hosting infrastructure. As a result, new Internet. Sometimes it is a machine-readable Internet Protocol
malicious URLs are introduced and blacklist providers must (IP) address, but more often especially from the user’s
update their lists yet again. However, in this process, criminals perspective it is a human-readable domain name.
are always ahead because Web site construction is The <path> of a URL is analogous to the path name of a
inexpensive. Moreover, free services for blogs e.g., Blogger file on a local computer. The path tokens delimited by various
[20] and personal hosting e.g., Google Sites [21], Microsoft punctuation marks such as slashes, dots, and dashes, show
Live Spaces [22] provide another inexpensive source of how the site is organized. Criminals sometimes obscure path
disposable sites. tokens to avoid scrutiny, or they may deliberately construct
tokens to mimic the appearance of a legitimate site.
4) Page/Popularity Based Property: Popularity features The methodology used in our work to extract the lexical
indicate how popular a web page is among Internet users. features from the URL list is as follows: The URLs of
Various popularity features are as follows: legitimate websites, collected from alexa.com and dmoz.org,
a) PageRank [10]: It is one of the methods Google uses are written into a notepad and the file is saved in the computer.
to determine a page's relevance or importance. The maximum Then the MATLAB program is executed. It will ask for input
306
file. Feed the benign URL list to the MATLAB program. The likelihood estimation. It takes only one pass over the training
program processes the list and the feature list is obtained. The set and is computationally very fast.
decision vector ‘0’ is added. The list is saved in excel and csv 2) J48 decision tree: A decision tree is a predictive
format at location in the computer as specified in the program. machine-learning model that decides the target value
The same procedure is done for phishing URL list. The (dependent variable) of a new sample based on various
decision vector ‘1’ is added. The feature set comprises of host attribute values of the available data.
length, path length, number of slashes, number of path tokens
etc. The Figure 5 shows the flowchart of feature extraction. 3) K-NN: It is based on closest training examples in
the feature space. An object is classified by a majority vote of
its neighbors.
4) SVM: The SVM performs classification by finding
the hyper plane that maximizes the margin between two
classes. The vectors that define the hyper plane are the support
vectors.
The program flow for the classifier performance is shown
in Figure 6.
Start
Generate train.xls,trainresult.xls,
test.xls,testresult.xls files
Performance analysis
Figure5. Flow chart for feature extraction
307
Figure 7 and Figure 8. PageRank obtained for phishing sites We analyzed the prepared URL feature dataset using
are: Not Available, Non-Existing and 0. Naïve Bayes, J48 Decision Tree, k-NN, and SVM classifying
algorithms in WEKA. The percentage split is set to 60% i.e.,
The N/A pagerank (grey pagerank bar) might be due to 40 percentage of the dataset is taken as training data and 60
one of the following reasons [11]: percentage as test data. The performance is then evaluated
• Web page is new, and it is not indexed by Google yet. based on Confusion matrix, Detection Accuracy, True Positive
• Web page is indexed by Google, but it is not Rate and False Positive Rate. The result is tabulated in
ranked yet. TABLE 1.
• Web page was indexed by Google long ago, but it is The analysis of the dataset is done using MATLAB also by
recognized as a supplemental page. setting the above said testing conditions and is tabulated in
• Web page or the whole website is banned by Google. TABLE 2.
Supplemental Result is a URL residing in Google's When we check the Success Rate in analysis by WEKA
secondary database containing pages of less importance, as and MATLAB, it is seen that there are slight differences in
measured primarily by Google's PageRank algorithm. Google values. The J48 Decision Tree has the highest Success Rate
used to place a "Supplemental Result" label at the bottom of a compared to other selected classifying algorithms in WEKA.
search result to indicate that it is in the supplemental index; By using only the lexical features, we were able to achieve a
however in July 2007 they discontinued this practice and it is Detection Accuracy/Success rate of 93.2% for test split of
no longer possible to tell whether a result is in the 60%. When 90% of dataset is used, we got 93.78% Detection
supplemental index or the main one[11]. Accuracy. In MATLAB, using Regression Tree we got
PageRank for benign sites ranges from 0 to 9. We used 91.08% detection accuracy when using 60% of dataset for
240 benign URLs and 240 malicious URL sites for the plot. It testing and 85.63% detection accuracy when using 90% of
is inferred from the graph that the PageRank is pretty high for data for testing.
benign URLs compared to phishing websites. One exception is TABLE 1. Classifier Performance - WEKA
about newly registered websites. If we do the PageRank check
we will get ‘N/A’ (Not Available) message from the
PageRank Checker [11]. Test Confusion
Success Error
Classifier Rate Rate
options Matrix
(%) (%)
Number of Web Pages Vs PageRank
Naïve 4438 3578
250 68.60 31.40
Bayes 260 3945
200
7612 404
J48 93.20 6.80
150 Percentage
428 3777
80
1846 126
SVM 84.26 15.74
60 355 728
40
20
Figure 9 shows a comparison of TP Rate, FP Rate and
0 Detection Accuracy of SVM, Naïve Bayes, Regression Tree
and k-NN classifiers.
10.0
0.0
1
2
3
4
5
6
7
8
9
308
and the classifier makes the decision whether ‘Benign’ or
TABLE2. Classifier performance – MATLAB ‘Phish’ with its specified accuuracy.
Error
Test
Classifier
Confusion S
Success
Rate
VI. CO
ONCLUSION
Options Matrix Rate (%)
R
(%) Several features are compaared using various data mining
Naïve 7281 303 algorithms. The results pointts to the efficiency that can be
3633 4042 74.20 25.80
Bayes achieved using the lexical feattures. To protect end users from
Regression 10856 470 visiting these sites, we can tryy to identify phishing URLs by
Tree 1166 5839 91.08 8.92 analyzing their lexical and hoost-based features. A particular
Percentage
split-60
challenge in this domain iss that criminals are constantly
11299 3025 20.45
KNN 723 3284 79.55 making new strategies to couunter our defense measures. To
succeed in this contest, we need
n algorithms that continually
9871 806 adapt to new examples and feaatures of phishing URLs.
SVM 87.65 12.35
1082 3531
Online learning algorithhms provide better learning
Naïve 13648 1018 methods compared to batchh-based learning mechanisms.
2764 5500 83.50 16.50
Bayes Going forward we are interessted in various aspects of online
Regression 15082 999 learning and collecting data tot understand the new trends in
85.63 14.37 phishing activities such as fastt changing DNS servers.
Percentage Tree 2951 8465
split-90 16451 5080 24.23
KNN 75.77 REFE
ERENCES
1582 4384
[1] Phishing Trends Report for Q3
Q 2012, Anti Phishing Working
16416 5848 Group. http://antiphishing.orrg.
SVM 74.48 25.52
5 661 [2] Report on Phishing, Binationnal Working Group on Cross-
Border Mass Marketing Frauud, October 2006
[3] J. Ma, L. K. Saul, S. Savage,, and G. M. Voelker,” Beyond
Blacklists: Learning to Detect Phishing Web Sites from
120 Suspicious URLs”, Proc.of SIGKDD
S ’09.
100 TP Rate [4] J. Ma, L. K. Saul, S. Savage,, and G. M. Voelker, ”Learning to
80 Detect Phishing URLs”, AC CM Transactions on Intelligent
60 FP Rate Systems and Technology, Vol.
V 2, No. 3, Article 30, Publication
40
%
309