MuthyamVaishnavi Department of Electronics and Communication Engineering Department of Electronics and Communication Engineering Sridevi Women’s Engineering College, Email id: pravallika.arra2000@gmail.com,vanamalashivani Hyderabad, India 273@gmail.com,muthyamvaishnavigoud540@gmail.com Email id: santhuhatti90@gmail.com Sridevi Women’s Engineering College, Hyderabad, India Abstract— Phishing is an attempt to obtain confidential known attacks that match the signature of the heuristic information about a user or an organization. It is an act of pattern to decide if it is a phishing website. Additionally, impersonating a credible webpage to lure users to expose measuring website traffic using Alexa is another way that has sensitive data, such as username, password and credit card been implemented by researchers to detect phishing information. It has cost the online community and various stakeholders hundreds of millions of dollars. There is a need to websites. Moreover, other researchers have used machine detect and predict phishing, and the machine learning learning techniques. Machine learning is a field of computer classification approach is a promising approach to do so. science, which is also a branch of artificial intelligence (AI) However, it may take several phases to identify and tune the that performs tasks and is capable of learning or acting in an effective features from the dataset before the selected classifier intelligent way. It has two different types of learning: can be trained to identify phishing sites correctly. This paper supervised learning and unsupervised learning. Supervised presents the performance of phishing webpage detection via two different machine learning techniques:-XGboost, Logistic learning is based on training a model by giving it a set of Regression and by using deep learning technique LSTM measured features of data associated with a target label Algorithm. The most effective classification performance of two related to these data, and once the model is trained it can machine learning algorithms is further rectified. The generate a new target label with unknown data. On the other observational results have shown that the optimized XGboost hand, unsupervised learning is based on generating new data achieves the highest performance among all the techniques. without giving any target label in the training process. Keywords— Relevant features; phishing; web threat; classification; machine learning; deep learning II. LITERATURE SURVEY I. INTRODUCTION A.Cantina: AContent-based Approach to Detecting Phishing In today’s world, technology has become an integral part of Web Sites the twenty-first century. The internet is one of these Phishing is a significant problem involving fraudulent email technologies, which is growing rapidly every year and plays and web sites that trick unsuspecting users into revealing an important role in individuals’ lives. It has become a private information. In this paper, we present the design, valuable and a convenient mechanism for supporting public implementation, and evaluation of CANTINA, a novel, transactions such as e-banking and e-commerce transactions. content-based approach to detecting phishing web sites, That has led the users to trust it is convenient to provide their based on the TF-IDF information retrieval algorithm. We private information to the Internet. As a result, the security also discuss the design and evaluation of several heuristics thieves that have started to target this information have we developed to reduce false positives. Our experiments become a major security problem. Phishing websites are show that CANTINA is good at detecting phishing sites, considered to be one of these problems. They are using a correctly labelling approximately 95% of phishing sites. social engineering trick, which can be described as fraudsters that try to manipulate the user into giving them their personal B. Techniques for detecting zero day phishing websites information based on exploiting human vulnerabilities rather than software vulnerabilities. Statistics have shown that the Phishing is a means of obtaining confidential information number of phishing attacks keeps increasing, which presents through fraudulent web sites that appear to be legitimate. a security risk to the user information according to the There are many phishing detection techniques available, but AntiPhishing Working Group (APWG) and recorded current practices leave much to be desired. A central problem phishing attacks by Kaspersky Lab, which stated that it has is that web browsers rely on a black list of known phishing increased by 47.48% from all of the phishing attacks that sites, but some phishing sites have a lifespan as short as a few have been detected during 2016. Recently, there have been hours. A faster recognition system is needed by the web several studies that tried to solve the phishing problem. Some browser to identify zero day phishing sites which are new researchers used the URL and compared it with existing phishing sites that have not yet been discovered. blacklists that contain lists of malicious websites, which they This research improves upon techniques used by popular have been creating, and there are others that have used the anti-phishing software and introduces a new method of URL in an opposite manner, namely comparing the URL detecting fraudulent web pages using cascading style sheets with a whitelist of legitimate websites. The latter approach (CSS). Current phishing detection techniques are examined uses heuristics, which uses a signature database of any and a new detection method is implemented and evaluated against hundreds of known phishing sites. that the proposed model utilizing information gain algorithm C.PhishShield: A Desktop Application to Detect Phishing Webpages in procedure of feature subset selection and Random Tree through Heuristic Approach algorithm in heuristic classification achieves 96% accuracy with less time cost. Phishing is a website forgery with an intention to track and steal the sensitive information of online users. The attacker F. Web phishing detection using classifier ensemble fools the user with social engineering techniques such as This research adapts and develops various methods in SMS, voice, email, website and malware. Artificial Intelligent (A.I) field to improve web phishing In this paper, we implemented a desktop application called detection. Based on the features from Carnegie Mellon PhishShield, which concentrates on URL and Website Antiphishing and Network Analysis Tool (CANTINA), we Content of phishing page. PhishShield takes URL as input add, modify or reduce features in case of using to train a and outputs the status of URL as phishing or legitimate machine learning method. We also add our developed website. The heuristics used to detect phishing are footer features called homepage similarity features to the machine. links with null value, zero links in body of html, copyright Moreover, we applied the classifier ensemble concept to the content, title content and website identity. PhishShield is able study. After training with 500 phishing web pages and 500 to detect zero hour phishing attacks which blacklists unable non-phishing web pages, the experiments on 1,500 pages per to detect and it is faster than visual based assessment each class showed that our proposed methodology could techniques that are used in detecting phishing. The accuracy boost accuracy up to approximately 30% from traditional rate obtained for PhishShield is 96.57% and covers a wide heuristic method's results. range of phishing web sites resulting less false negative and false positive rate. III .METHODOLOGY The framework in figure 1 represents the module description D.Detecting phishing web sites: A heuristic URL-based of the analysis. approach
With the growth of Internet, e-commerce plays a vital role in
the society. As a result, phishing, the act of stealing personal user data used in e-commerce transaction, has been becoming an emergency problem in modern society. Many techniques have been proposed to protect online users, e.g. blacklist, pagerank. However, the numbers of victims have been increasing due to inefficient protection technique. This is due to the fact that phishers try to make the URL of phishing sites look similar to original sites. In this paper, we are interested in proposing a new approach to detect phishing site by using the features of URL. Particularly, we derive different components from URL and compute a metric for Figure 1. Block diagram. each component. Then, the page ranking will be combined with the achieved metrics to decide whether the websites are A.Dataset phishing websites. The proposed phishing detection URLs of benign websites were collected from technique was evaluated with the dataset of contains 9,661 www.alexa.com and The URLs of phishing websites were phishing websites and 1,000 legitimate websites. The results collected from www.phishtank.com. The data set consists show that our proposed technique can detect over 97% of total 25,469 URLs which include 12,058 benign URLs phishing websites. and 13411 phishing URLs. Benign URLs are labelled as “B” and phishing URLs are labelled as “M”. E.A Novel Multi-Layer Heuristic Model for Anti-Phishing B..Data Preprocessing Phishing website detection1 is very important for e-banking Data preprocessing consists of cleansing, instance and e-commerce users. Current detection methods for selection, feature extraction, normalization, transformation, antiphishing have proved to be well-performed in term of etc. The results of data preprocessing is that the absolute accuracy, recall rate and F-measure. However, increasingly training dataset. Data preprocessing may impact how results complex phishing methods make it more necessary to of the ultimate processing is interpreted. Data cleaning could optimize the detection scheme by timely recognizing new be a step where filling the missing data, smoothing of noise, features and accurately choosing the optimal feature subset. recognizing or removing outliers and resolving To react changing phishing means and resolve feature incompatibilities is done. Data Integration may be a method updating issues, we propose a novel multi-layer heuristic where the addition of certain databases, or data sets is done. anti-phishing model with feature selection algorithms and Data transformation is whereby collection and normalization heuristic classification algorithms. Five feature selection are performed to measure a particular data. By doing data algorithms are utilized to pre-process feature sets. Then four reduction we can achieve an overview of the dataset that is classification algorithms are applied to identify phishing very small in size but, which helps to produce the identical websites and legitimate websites. Experimental results show outcome of the analysis C.Exploratory Data Analysis URL i.e. subdomain and path. This is the reason the A technique in data analysis that provides more than one cybersecurity experts, designers and users struggle to method that is primarily diagrammatic is known as provide a feasible solution to mitigate URL phishing Exploratory Data Analysis (EDA) as shown in Figure 3. It attacks. maximizes the perception of a data set, unveil the hidden Let’s consider a phishing URL given below structure, excerpt essential parameters, locates outliers as “http://amazon.comverificationaccounts.darotob.com/Sig well as anomalies and test hidden presumptions. nin/5b60fcc60b36d1c3d” The lexical analysis of the above URL reveals parts as shown in Fig. 2. The attackers obfuscate the URL in such a way that the actual domain name might not be easily revealed to the normal user and it will be nested deep inside the URL, e.g. in the above URL actual Domain name is “darotob.com”, but
Protocol http://
Domain Name darotob.com
Path /Sign-in/5b60fcc60b36d1c3d
Sub Domain com-verification-accounts
Figure 2. Heatmap Sub Domain Amazon D.Train-test split The dataset is part into two subsets as testing set and training Figure. 4 Different parts of the URL set so that the training dataset can be equipped with the at first glance, it might look like “amazon.com”. For a algorithms and then used for detecting the phishing websites normal user not aware of the intricacies of the web on testing dataset. 30% of the data is reviewed for the testing technologies, “amazon.com” at the beginning of the set so that the training model will train and learn the data URL, can provide the assurance and trust about the effectively. website and he might tempt to connect it and might share his confidential and sensitive information to the IV. LEXICAL STRUCTURE OF A URL fraudulent website. This is a very common fraudulent technique used by the attackers. The cybercriminals can The lexical structure of a URL as shown in Fig.1 could employ a technique, Cybersquatting [18] (another name reveal the hidden information about a URL. A URL starts is domain squatting), where he registers a domain name with a protocol name such as HTTP or HTTPs. The FQDN with a bad intention to make a profit from the goodwill (fully qualified domain name) is the complete domain name of a brand name or trademark belonging to other of the server hosting the web site, which later translates into companies or organizations. For example, the name of a an IP address using DNS servers. The domain name consists renowned brand is “greatcompany.com”, the phisher can of a second-level domain (SLD) which is suffixed with the register toplevel domain (TLD) to which it is registered. The domain “greatcompany.net”,“greatcompany.org”,“greatcompan name is a registered name that is registered with a domain y.biz” etc. for fraud. Cybercriminals can also adopt registrar and unique across the Internet Typosquatting [18] (also known as URL hijacking), which a varying form of cybersquatting, which relies on typographical mistakes unknowingly made by the users while typing the website address into a web browser, such typographical errors are hard to notice while quick casual reading. The URLs created by Typosquatting look Figure. 3.Lexical structure of a URL very similar to the well-known trusted domains. The user occasionally might type the incorrect web address or The domain name portion is a registered name with click a link might look very similar to the trusted domains Domain name Registrar. The hostname consists of a and this might lead him to visit a phishing website owned subdomain name and domain name. A phisher can easily by a phisher. A very famous example of Typo squatting modify the subdomain name portion and can associate it is “goggle.com”, which is a phishing website and with any value. The URL may also contain path and file extremely dangerous. Another example of Typosquatting which can also be easily modified by the phisher if he is “yutube.com” which is a Typosquatting equivalent of wishes so. The subdomain name and path of a URL can “youtube.com”. be controlled by the phisher. An attacker can register a domain only once and this domain is identified as V. MACHINE LEARNING APPROACH fraudulent, it is easy to prevent the user from visiting such Machine learning provides simplified and efficient domain. The problem lies with the variable parts of the methods for data analysis. It has indicated promising outcomes in realtime classification problems recently. The VI. MODELING PHISHING URLS WITH key advantage of machine learning is the ability to create RECURRENT NEURAL NETWORKS flexible models for specific tasks like phishing detection. Since phishing is a classification problem, Machine learning A neural network is a bio-inspired machine learning model models can be used as a powerful tool. Machine learning that consists of a set of artificial neurons with connections models could adapt to changes quickly to identify patterns of between them. Recurrent Neural Networks (RNN) are a type fraudulent transactions that help to develop a learning-based of neural network that is able to model sequential patterns. identification system. Most of the machine learning models The distinctive characteristic of RNNs is that they introduce discussed here are classified as supervised machine learning, the notion of time to the model, which in turn allows them to This is where an algorithm tries to learn a function that maps process sequential data one element at a time and learn their an input to an output based on example input-output pairs. It sequential dependencies [10]. infers a function from labeled training data consisting of a set of training examples. We present machine learning methods that we used in our study. A. Logistic Regression Logistic Regression is a classification algorithm used to assign observations to a discrete set of classes. Unlike linear regression which outputs continuous number values, Logistic Regression transforms its output using the logistic sigmoid function to return a probability value which can then be mapped to two or more discrete classes. Logistic regression works well when the relationship in the data is almost linear despite if there are complex nonlinear relationships between variables, it has poor performance. Besides, it requires more statistical assumptions before using other techniques. B. Gradeint Boosting Figure.5.Recurrent neural network for classifying phishing Gradient Boosting trains many models incrementally and URLs based on LSTM units. sequentially. The main difference between Ada-Boost and Gradient Boosting Algorithm is how algorithms identify the Each input character is translated by an shortcomings of weak learners like decision trees. While the 128dimension embedding. The translated URL is fed into Ada-Boost model identifies the shortcomings by using high a LSTM layer as a 150-step sequence. Finally, the weight data points, Gradient Boosting performs the same classification is performed using an output sigmoid methods by using gradients in the loss function. The loss neuron. function is a measure indicating how good the models One limitation of general RNNs is that they are coefficients are at fitting the underlying data. A logical unable to learn the correlation between elements more than understanding of loss function would depend on what we are 5 or 10 time steps apart [29]. A model that overcomes this trying to optimize. [20] problem is Long Short Term Memory (LSTM). This model can bridge elements separated by more than 1,000 time C. XGBoost steps without loss of short time lag capabilities [30]. XGBoost is a refined and customized version of a Gradient LSTM is an adaptation of RNN. Here, each Boosting to provide better performance and speed. The most neuron is replaced by a memory cell that, in addition to a important factor behind the success of XGBoost is its conventional neuron representing an internal state, uses scalability in all scenarios. The XGBoost runs more than ten multiplicative units as gates to control the flow of times faster than popular solutions on a single machine and information. A typical LSTM cell has an input gate that scales to billions of examples in distributed or controls the input of information from the outside, a forget memorylimited settings. The scalability of XGBoost is due cell that controls whether to keep or forget the information to several important algorithmic optimizations. These in the internal state, and an output gate that allows or innovations include a novel tree learning algorithm for prevents the internal state to be seen from the outside. handling sparse data; a theoretically justified weighted In this work, we used LSTM units to build a quantile sketch procedure enables handling instance weights model that receives as input a URL as character sequence in approximate tree learning. Parallel and distributed and predicts whether or not the URL corresponds to a case computing make learning faster which enables quicker of phishing. The architecture is illustrated in Fig. 2. Each model exploration. More importantly, XGBoost exploits input character is translated by a 128-dimension outof-core computation and enables data scientists to process embedding. The translated URL is fed into a LSTM layer hundreds of millions of examples on a desktop. Finally, it is as a 150-step sequence. Finally, the classification is even more exciting to combine these techniques to make an performed using an output sigmoid neuron. The network is end-to-end system that scales to even larger data with the trained by backpropagation using a crossen tropy loss least amount of cluster resources. [21] function and dropout in the last layer. RESULT REFERENCES The phishing website detection model has been tested and [1] AO Kaspersky lab. (2017). The Dangers of trained using many classifiers and ensemble algorithms to Phishing: Help employees avoid the lure of cybercrime. analyze and compare the model’s result for best accuracy. [Online] Available:https://go.kaspersky.com/Dangers- PhishingLanding-Page- Soc.html [Oct 30, 2017]. [2] ”Financial threats in 2016: Every Second Phishing Attack Aims to Steal Your Money” 2017 financial-threats- in-2016. Feb 22, 2017 [Oct 30, 2017]. [3] Y. Zhang, J. I. Hong, and L. F. Cranor, ”Cantina: A Content-based Approach to Detecting Phishing Web Sites,” New York, NY, USA, 2007, pp. 639-648. [4] M. Blasi, ”Techniques for detecting zero day phishing websites.” M.A. thesis, Iowa State University, USA, 2009. [5] R. S. Rao and S. T. Ali, ”PhishShield: A Each algorithm will give its evaluated accuracy after all the Desktop Application to Detect Phishing Webpages through algorithms return its result. each is compared with other Heuristic Approach,” Procedia Computer Science, vol. 54, algorithms to see which provides the high accuracy no. Supplement C, pp. 147-156, 2015. percentage as shown in Table 1. Each algorithm’s accuracy [6] E. Jakobsson, and E. Myers, Phishing and will be depicted in the confusion matrix for greater CounterMeasures: Understanding the Increasing comprehension. The dataset is also trained using a deep Problem of Electronic Identity Theft. Wiley, 2006, learning algorithm. The final accuracy comparison of pp.2–3. algorithms is shown in Figures 4. [7] L. A. T. Nguyen, B. L. To, H. K. Nguyen, and M. H. Nguyen, ”Detecting phishing web sites: A heuristic Table 1.Comparision Table URLbased approach,” in 2013 International Classifiers Training Testing Precision Conference on Advanced Technologies for set set accuracy Communications (ATC 2013), 2013, pp. 597-602. accuracy accuracy [8] Z. Zhang, Q. He, and B. Wang, ”A Novel Multi-Layer Heuristic Model for Anti-Phishing,” New York, NY, Logistic 92.00 92.00 89.00 USA, 2017, p. 21:1-21:6. Regression [9] N. Sanglerdsinlapachai and A. Rungsawang, ”Web XGboost 93.80 93.40 93.42 Phishing Detection Using Classifier Ensemble,” New York, NY, USA, 2010, pp. 210-215. [10] G. Xiang, J. Hong, C. P. Rose, and L. Cranor, ”CANTINA+: A Feature- Rich Machine Learning Framework for Detecting Phishing Web Sites,” ACM Trans. Inf. Syst. Secur., vol. 14, no. 2, pp. 21:1-21:28, Sep. 2011. [11] R. M. Mohammad, F. Thabtah, and L. McCluskey, ”Predicting phishing websites based on self-structuring neural network,” Neural Comput & Applic, vol. 25, no. 2, pp. 443-458, Aug. 2014. [12] Pradeepthi K V and Kannan A, ”Performance study of classification techniques for phishing URL detection,” in 2014 Sixth International Conference on Advanced Figure.6. Comparision of ML algorithms Computing (ICoAC), 2014, pp. 135-139. [13] S. Marchal, J. Franois, R. State, and T. Engel, CONCLUSION ”PhishStorm: Detecting Phishing With Streaming This paper aims to enhance detection method to detect Analytics,” IEEE Transactions on Network and Service phishing websites using machine learning technology. We Management, vol. 11, no. 4, pp. 458-471, Dec. 2014. achieved 97.14% detection accuracy using random forest [14] A. Sirageldin, B. B. Baharudin, and L. T. Jung, algorithm with lowest false positive rate. Also result shows ”Malicious Web Page Detection: A Machine Learning that classifiers give better performance when we used more Approach,” in Advances in Computer Science and its data as training data. In future hybrid technology will be Applications, Springer, Berlin, Heidelberg, 2014, pp. implemented to detect phishing websites more accurately, for 217224. which random forest algorithm of machine learning [15] R. Verma and K. Dyer, ”On the Character of Phishing technology and blacklist method will be used. URLs: Accurate and Robust Statistical Learning Classifiers,” New York, NY, USA, 2015, pp. 111-122. [16] H. H. Nguyen ACKNOWLEDGMENT and D. T. Nguyen, ”Machine Learning Based Phishing Web Special thanks for the guidance to our supervisor Sites Detection,” in AETA 2015: Recent Advances in professor Mr.Mallesh Hatti. Electrical Engineering and Related Sciences, V. H. Duy, T. T. Dao, I. Zelinka, H.- S. Choi, and M. Chadli, Eds. Cham: Springer International Publishing, 2016, pp. 123-131.
THE APPLIED DATA SCIENCE WORKSHOP Urinary Biomarkers Based Pancreatic Cancer Classification and Prediction (Vivian Siahaan Rismon Hasiholan Sianipar) (Z-Library)