paper-major1

Feature Selection for Phishing Website Classification
Vanamala Shivani, A.B.Pravallika , MR.Mallesh Hatti

MuthyamVaishnavi Department of Electronics and Communication Engineering
Department of Electronics and Communication Engineering Sridevi Women’s Engineering College,
Email id: pravallika.arra2000@gmail.com,vanamalashivani Hyderabad, India
273@gmail.com,muthyamvaishnavigoud540@gmail.com Email id: santhuhatti90@gmail.com
Sridevi Women’s Engineering College,
Hyderabad, India
Abstract— Phishing is an attempt to obtain confidential known attacks that match the signature of the heuristic
information about a user or an organization. It is an act of pattern to decide if it is a phishing website. Additionally,
impersonating a credible webpage to lure users to expose measuring website traffic using Alexa is another way that has
sensitive data, such as username, password and credit card
been implemented by researchers to detect phishing
information. It has cost the online community and various
stakeholders hundreds of millions of dollars. There is a need to websites. Moreover, other researchers have used machine
detect and predict phishing, and the machine learning learning techniques. Machine learning is a field of computer
classification approach is a promising approach to do so. science, which is also a branch of artificial intelligence (AI)
However, it may take several phases to identify and tune the that performs tasks and is capable of learning or acting in an
effective features from the dataset before the selected classifier intelligent way. It has two different types of learning:
can be trained to identify phishing sites correctly. This paper supervised learning and unsupervised learning. Supervised
presents the performance of phishing webpage detection via two
different machine learning techniques:-XGboost, Logistic
learning is based on training a model by giving it a set of
Regression and by using deep learning technique LSTM measured features of data associated with a target label
Algorithm. The most effective classification performance of two related to these data, and once the model is trained it can
machine learning algorithms is further rectified. The generate a new target label with unknown data. On the other
observational results have shown that the optimized XGboost hand, unsupervised learning is based on generating new data
achieves the highest performance among all the techniques. without giving any target label in the training process.
Keywords— Relevant features; phishing; web threat;
classification; machine learning; deep learning
II. LITERATURE SURVEY
I. INTRODUCTION A.Cantina: AContent-based Approach to Detecting Phishing
In today’s world, technology has become an integral part of Web Sites
the twenty-first century. The internet is one of these Phishing is a significant problem involving fraudulent email
technologies, which is growing rapidly every year and plays and web sites that trick unsuspecting users into revealing
an important role in individuals’ lives. It has become a private information. In this paper, we present the design,
valuable and a convenient mechanism for supporting public implementation, and evaluation of CANTINA, a novel,
transactions such as e-banking and e-commerce transactions. content-based approach to detecting phishing web sites,
That has led the users to trust it is convenient to provide their based on the TF-IDF information retrieval algorithm. We
private information to the Internet. As a result, the security also discuss the design and evaluation of several heuristics
thieves that have started to target this information have we developed to reduce false positives. Our experiments
become a major security problem. Phishing websites are show that CANTINA is good at detecting phishing sites,
considered to be one of these problems. They are using a correctly labelling approximately 95% of phishing sites.
social engineering trick, which can be described as fraudsters
that try to manipulate the user into giving them their personal B. Techniques for detecting zero day phishing websites
information based on exploiting human vulnerabilities rather
than software vulnerabilities. Statistics have shown that the Phishing is a means of obtaining confidential information
number of phishing attacks keeps increasing, which presents through fraudulent web sites that appear to be legitimate.
a security risk to the user information according to the There are many phishing detection techniques available, but
AntiPhishing Working Group (APWG) and recorded current practices leave much to be desired. A central problem
phishing attacks by Kaspersky Lab, which stated that it has is that web browsers rely on a black list of known phishing
increased by 47.48% from all of the phishing attacks that sites, but some phishing sites have a lifespan as short as a few
have been detected during 2016. Recently, there have been hours. A faster recognition system is needed by the web
several studies that tried to solve the phishing problem. Some browser to identify zero day phishing sites which are new
researchers used the URL and compared it with existing phishing sites that have not yet been discovered.
blacklists that contain lists of malicious websites, which they This research improves upon techniques used by popular
have been creating, and there are others that have used the anti-phishing software and introduces a new method of
URL in an opposite manner, namely comparing the URL detecting fraudulent web pages using cascading style sheets
with a whitelist of legitimate websites. The latter approach (CSS). Current phishing detection techniques are examined
uses heuristics, which uses a signature database of any and a new detection method is implemented and evaluated
against hundreds of known phishing sites.
that the proposed model utilizing information gain algorithm
C.PhishShield: A Desktop Application to Detect Phishing Webpages in procedure of feature subset selection and Random Tree
through Heuristic Approach algorithm in heuristic classification achieves 96% accuracy
with less time cost.
Phishing is a website forgery with an intention to track and
steal the sensitive information of online users. The attacker F. Web phishing detection using classifier ensemble
fools the user with social engineering techniques such as This research adapts and develops various methods in
SMS, voice, email, website and malware. Artificial Intelligent (A.I) field to improve web phishing
In this paper, we implemented a desktop application called detection. Based on the features from Carnegie Mellon
PhishShield, which concentrates on URL and Website Antiphishing and Network Analysis Tool (CANTINA), we
Content of phishing page. PhishShield takes URL as input add, modify or reduce features in case of using to train a
and outputs the status of URL as phishing or legitimate machine learning method. We also add our developed
website. The heuristics used to detect phishing are footer features called homepage similarity features to the machine.
links with null value, zero links in body of html, copyright Moreover, we applied the classifier ensemble concept to the
content, title content and website identity. PhishShield is able study. After training with 500 phishing web pages and 500
to detect zero hour phishing attacks which blacklists unable non-phishing web pages, the experiments on 1,500 pages per
to detect and it is faster than visual based assessment each class showed that our proposed methodology could
techniques that are used in detecting phishing. The accuracy boost accuracy up to approximately 30% from traditional
rate obtained for PhishShield is 96.57% and covers a wide heuristic method's results.
range of phishing web sites resulting less false negative and
false positive rate. III .METHODOLOGY
The framework in figure 1 represents the module description
D.Detecting phishing web sites: A heuristic URL-based of the analysis.
approach
With the growth of Internet, e-commerce plays a vital role in

the society. As a result, phishing, the act of stealing personal
user data used in e-commerce transaction, has been
becoming an emergency problem in modern society. Many
techniques have been proposed to protect online users, e.g.
blacklist, pagerank. However, the numbers of victims have
been increasing due to inefficient protection technique. This
is due to the fact that phishers try to make the URL of
phishing sites look similar to original sites. In this paper, we
are interested in proposing a new approach to detect phishing
site by using the features of URL. Particularly, we derive
different components from URL and compute a metric for Figure 1. Block diagram.
each component. Then, the page ranking will be combined
with the achieved metrics to decide whether the websites are A.Dataset
phishing websites. The proposed phishing detection URLs of benign websites were collected from
technique was evaluated with the dataset of contains 9,661 www.alexa.com and The URLs of phishing websites were
phishing websites and 1,000 legitimate websites. The results collected from www.phishtank.com. The data set consists
show that our proposed technique can detect over 97% of total 25,469 URLs which include 12,058 benign URLs
phishing websites. and 13411 phishing URLs. Benign URLs are labelled as
“B” and phishing URLs are labelled as “M”.
E.A Novel Multi-Layer Heuristic Model for Anti-Phishing B..Data Preprocessing
Phishing website detection1 is very important for e-banking Data preprocessing consists of cleansing, instance
and e-commerce users. Current detection methods for selection, feature extraction, normalization, transformation,
antiphishing have proved to be well-performed in term of etc. The results of data preprocessing is that the absolute
accuracy, recall rate and F-measure. However, increasingly training dataset. Data preprocessing may impact how results
complex phishing methods make it more necessary to of the ultimate processing is interpreted. Data cleaning could
optimize the detection scheme by timely recognizing new be a step where filling the missing data, smoothing of noise,
features and accurately choosing the optimal feature subset. recognizing or removing outliers and resolving
To react changing phishing means and resolve feature incompatibilities is done. Data Integration may be a method
updating issues, we propose a novel multi-layer heuristic where the addition of certain databases, or data sets is done.
anti-phishing model with feature selection algorithms and Data transformation is whereby collection and normalization
heuristic classification algorithms. Five feature selection are performed to measure a particular data. By doing data
algorithms are utilized to pre-process feature sets. Then four reduction we can achieve an overview of the dataset that is
classification algorithms are applied to identify phishing very small in size but, which helps to produce the identical
websites and legitimate websites. Experimental results show outcome of the analysis
C.Exploratory Data Analysis URL i.e. subdomain and path. This is the reason the
A technique in data analysis that provides more than one cybersecurity experts, designers and users struggle to
method that is primarily diagrammatic is known as provide a feasible solution to mitigate URL phishing
Exploratory Data Analysis (EDA) as shown in Figure 3. It attacks.
maximizes the perception of a data set, unveil the hidden Let’s consider a phishing URL given below
structure, excerpt essential parameters, locates outliers as
“http://amazon.comverificationaccounts.darotob.com/Sig
well as anomalies and test hidden presumptions.
nin/5b60fcc60b36d1c3d”
The lexical analysis of the above URL reveals parts as shown
in Fig. 2. The attackers obfuscate the URL in such a way that
the actual domain name might not be easily revealed to the
normal user and it will be nested deep inside the URL, e.g.
in the above URL actual Domain name is “darotob.com”, but
Protocol http://
Domain Name darotob.com
Path /Sign-in/5b60fcc60b36d1c3d
Sub Domain com-verification-accounts

Figure 2. Heatmap
Sub Domain Amazon
D.Train-test split
The dataset is part into two subsets as testing set and training Figure. 4 Different parts of the URL
set so that the training dataset can be equipped with the at first glance, it might look like “amazon.com”. For a
algorithms and then used for detecting the phishing websites normal user not aware of the intricacies of the web
on testing dataset. 30% of the data is reviewed for the testing technologies, “amazon.com” at the beginning of the
set so that the training model will train and learn the data URL, can provide the assurance and trust about the
effectively. website and he might tempt to connect it and might share
his confidential and sensitive information to the
IV. LEXICAL STRUCTURE OF A URL fraudulent website. This is a very common fraudulent
technique used by the attackers. The cybercriminals can
The lexical structure of a URL as shown in Fig.1 could employ a technique, Cybersquatting [18] (another name
reveal the hidden information about a URL. A URL starts is domain squatting), where he registers a domain name
with a protocol name such as HTTP or HTTPs. The FQDN with a bad intention to make a profit from the goodwill
(fully qualified domain name) is the complete domain name of a brand name or trademark belonging to other
of the server hosting the web site, which later translates into companies or organizations. For example, the name of a
an IP address using DNS servers. The domain name consists renowned brand is “greatcompany.com”, the phisher can
of a second-level domain (SLD) which is suffixed with the register
toplevel domain (TLD) to which it is registered. The domain “greatcompany.net”,“greatcompany.org”,“greatcompan
name is a registered name that is registered with a domain y.biz” etc. for fraud. Cybercriminals can also adopt
registrar and unique across the Internet Typosquatting [18] (also known as URL hijacking),
which a varying form of cybersquatting, which relies on
typographical mistakes unknowingly made by the users
while typing the website address into a web browser,
such typographical errors are hard to notice while quick
casual reading. The URLs created by Typosquatting look
Figure. 3.Lexical structure of a URL very similar to the well-known trusted domains. The user
occasionally might type the incorrect web address or
The domain name portion is a registered name with click a link might look very similar to the trusted domains
Domain name Registrar. The hostname consists of a and this might lead him to visit a phishing website owned
subdomain name and domain name. A phisher can easily by a phisher. A very famous example of Typo squatting
modify the subdomain name portion and can associate it is “goggle.com”, which is a phishing website and
with any value. The URL may also contain path and file extremely dangerous. Another example of Typosquatting
which can also be easily modified by the phisher if he is “yutube.com” which is a Typosquatting equivalent of
wishes so. The subdomain name and path of a URL can “youtube.com”.
be controlled by the phisher. An attacker can register a
domain only once and this domain is identified as V. MACHINE LEARNING APPROACH
fraudulent, it is easy to prevent the user from visiting such Machine learning provides simplified and efficient
domain. The problem lies with the variable parts of the methods for data analysis. It has indicated promising
outcomes in realtime classification problems recently. The VI. MODELING PHISHING URLS WITH
key advantage of machine learning is the ability to create RECURRENT NEURAL NETWORKS
flexible models for specific tasks like phishing detection.
Since phishing is a classification problem, Machine learning A neural network is a bio-inspired machine learning model
models can be used as a powerful tool. Machine learning that consists of a set of artificial neurons with connections
models could adapt to changes quickly to identify patterns of between them. Recurrent Neural Networks (RNN) are a type
fraudulent transactions that help to develop a learning-based of neural network that is able to model sequential patterns.
identification system. Most of the machine learning models The distinctive characteristic of RNNs is that they introduce
discussed here are classified as supervised machine learning, the notion of time to the model, which in turn allows them to
This is where an algorithm tries to learn a function that maps process sequential data one element at a time and learn their
an input to an output based on example input-output pairs. It sequential dependencies [10].
infers a function from labeled training data consisting of a set
of training examples. We present machine learning methods
that we used in our study.
A. Logistic Regression
Logistic Regression is a classification algorithm used to
assign observations to a discrete set of classes. Unlike linear
regression which outputs continuous number values, Logistic
Regression transforms its output using the logistic sigmoid
function to return a probability value which can then be
mapped to two or more discrete classes. Logistic regression
works well when the relationship in the data is almost linear
despite if there are complex nonlinear relationships between
variables, it has poor performance. Besides, it requires more
statistical assumptions before using other techniques.
B. Gradeint Boosting Figure.5.Recurrent neural network for classifying phishing
Gradient Boosting trains many models incrementally and URLs based on LSTM units.
sequentially. The main difference between Ada-Boost and
Gradient Boosting Algorithm is how algorithms identify the Each input character is translated by an
shortcomings of weak learners like decision trees. While the 128dimension embedding. The translated URL is fed into
Ada-Boost model identifies the shortcomings by using high a LSTM layer as a 150-step sequence. Finally, the
weight data points, Gradient Boosting performs the same classification is performed using an output sigmoid
methods by using gradients in the loss function. The loss neuron.
function is a measure indicating how good the models One limitation of general RNNs is that they are
coefficients are at fitting the underlying data. A logical unable to learn the correlation between elements more than
understanding of loss function would depend on what we are 5 or 10 time steps apart [29]. A model that overcomes this
trying to optimize. [20] problem is Long Short Term Memory (LSTM). This model
can bridge elements separated by more than 1,000 time
C. XGBoost steps without loss of short time lag capabilities [30].
XGBoost is a refined and customized version of a Gradient LSTM is an adaptation of RNN. Here, each
Boosting to provide better performance and speed. The most neuron is replaced by a memory cell that, in addition to a
important factor behind the success of XGBoost is its conventional neuron representing an internal state, uses
scalability in all scenarios. The XGBoost runs more than ten multiplicative units as gates to control the flow of
times faster than popular solutions on a single machine and information. A typical LSTM cell has an input gate that
scales to billions of examples in distributed or controls the input of information from the outside, a forget
memorylimited settings. The scalability of XGBoost is due cell that controls whether to keep or forget the information
to several important algorithmic optimizations. These in the internal state, and an output gate that allows or
innovations include a novel tree learning algorithm for prevents the internal state to be seen from the outside.
handling sparse data; a theoretically justified weighted In this work, we used LSTM units to build a
quantile sketch procedure enables handling instance weights model that receives as input a URL as character sequence
in approximate tree learning. Parallel and distributed and predicts whether or not the URL corresponds to a case
computing make learning faster which enables quicker of phishing. The architecture is illustrated in Fig. 2. Each
model exploration. More importantly, XGBoost exploits input character is translated by a 128-dimension
outof-core computation and enables data scientists to process embedding. The translated URL is fed into a LSTM layer
hundreds of millions of examples on a desktop. Finally, it is as a 150-step sequence. Finally, the classification is
even more exciting to combine these techniques to make an performed using an output sigmoid neuron. The network is
end-to-end system that scales to even larger data with the trained by backpropagation using a crossen tropy loss
least amount of cluster resources. [21] function and dropout in the last layer.
RESULT REFERENCES
The phishing website detection model has been tested and [1] AO Kaspersky lab. (2017). The Dangers of
trained using many classifiers and ensemble algorithms to Phishing: Help employees avoid the lure of cybercrime.
analyze and compare the model’s result for best accuracy. [Online] Available:https://go.kaspersky.com/Dangers-
PhishingLanding-Page- Soc.html [Oct 30, 2017].
[2] ”Financial threats in 2016: Every Second Phishing
Attack Aims to Steal Your Money” 2017 financial-threats-
in-2016. Feb 22, 2017 [Oct 30, 2017].
[3] Y. Zhang, J. I. Hong, and L. F. Cranor, ”Cantina: A
Content-based Approach to Detecting Phishing Web Sites,”
New York, NY, USA, 2007, pp. 639-648.
[4] M. Blasi, ”Techniques for detecting zero day
phishing websites.” M.A. thesis, Iowa State University,
USA, 2009. [5] R. S. Rao and S. T. Ali, ”PhishShield: A
Each algorithm will give its evaluated accuracy after all the Desktop Application to Detect Phishing Webpages through
algorithms return its result. each is compared with other Heuristic Approach,” Procedia Computer Science, vol. 54,
algorithms to see which provides the high accuracy no. Supplement C, pp. 147-156, 2015.
percentage as shown in Table 1. Each algorithm’s accuracy
[6] E. Jakobsson, and E. Myers, Phishing and
will be depicted in the confusion matrix for greater
CounterMeasures: Understanding the Increasing
comprehension. The dataset is also trained using a deep
Problem of Electronic Identity Theft. Wiley, 2006,
learning algorithm. The final accuracy comparison of
pp.2–3.
algorithms is shown in Figures 4.
[7] L. A. T. Nguyen, B. L. To, H. K. Nguyen, and M. H.
Nguyen, ”Detecting phishing web sites: A heuristic
Table 1.Comparision Table URLbased approach,” in 2013 International
Classifiers Training Testing Precision Conference on Advanced Technologies for
set set accuracy Communications (ATC 2013), 2013, pp. 597-602.
accuracy accuracy [8] Z. Zhang, Q. He, and B. Wang, ”A Novel Multi-Layer
Heuristic Model for Anti-Phishing,” New York, NY,
Logistic 92.00 92.00 89.00
USA, 2017, p. 21:1-21:6.
Regression
[9] N. Sanglerdsinlapachai and A. Rungsawang, ”Web
XGboost 93.80 93.40 93.42 Phishing Detection Using Classifier Ensemble,” New York,
NY, USA, 2010, pp. 210-215.
[10] G. Xiang, J. Hong, C. P. Rose, and L. Cranor,
”CANTINA+: A Feature- Rich Machine Learning
Framework for Detecting Phishing Web Sites,” ACM Trans.
Inf. Syst. Secur., vol. 14, no. 2, pp. 21:1-21:28, Sep. 2011.
[11] R. M. Mohammad, F. Thabtah, and L. McCluskey,
”Predicting phishing websites based on self-structuring
neural network,” Neural Comput & Applic, vol. 25, no.
2, pp. 443-458, Aug. 2014.
[12] Pradeepthi K V and Kannan A, ”Performance study of
classification techniques for phishing URL detection,”
in 2014 Sixth International Conference on Advanced
Figure.6. Comparision of ML algorithms
Computing (ICoAC), 2014, pp. 135-139.
[13] S. Marchal, J. Franois, R. State, and T. Engel,
CONCLUSION
”PhishStorm: Detecting Phishing With Streaming
This paper aims to enhance detection method to detect Analytics,” IEEE Transactions on Network and Service
phishing websites using machine learning technology. We Management, vol. 11, no. 4, pp. 458-471, Dec. 2014.
achieved 97.14% detection accuracy using random forest [14] A. Sirageldin, B. B. Baharudin, and L. T. Jung,
algorithm with lowest false positive rate. Also result shows ”Malicious Web Page Detection: A Machine Learning
that classifiers give better performance when we used more Approach,” in Advances in Computer Science and its
data as training data. In future hybrid technology will be Applications, Springer, Berlin, Heidelberg, 2014, pp.
implemented to detect phishing websites more accurately, for 217224.
which random forest algorithm of machine learning [15] R. Verma and K. Dyer, ”On the Character of Phishing
technology and blacklist method will be used. URLs: Accurate and Robust Statistical Learning Classifiers,”
New York, NY, USA, 2015, pp. 111-122. [16] H. H. Nguyen
ACKNOWLEDGMENT and D. T. Nguyen, ”Machine Learning Based Phishing Web
Special thanks for the guidance to our supervisor Sites Detection,” in AETA 2015: Recent Advances in
professor Mr.Mallesh Hatti. Electrical Engineering and Related Sciences, V. H. Duy, T.
T. Dao, I. Zelinka, H.- S. Choi, and M. Chadli, Eds. Cham:
Springer International Publishing, 2016, pp. 123-131.

paper-major1

Uploaded by

Copyright:

Available Formats

paper-major1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

paper-major1

Uploaded by

Copyright:

Available Formats

Feature Selection for Phishing Website Classification

Vanamala Shivani, A.B.Pravallika , MR.Mallesh Hatti

With the growth of Internet, e-commerce plays a vital role in

Domain Name darotob.com

Sub Domain com-verification-accounts

You might also like