research-article

Open access

Combining Text and Visual Features to Improve the Identification of Cloned Webpages for Early Phishing Detection

Authors:

Bram van Dooremaal,

Nicola ZannoneAuthors Info & Claims

ARES '21: Proceedings of the 16th International Conference on Availability, Reliability and Security

Article No.: 60, Pages 1 - 10

https://doi.org/10.1145/3465481.3470112

Published: 17 August 2021 Publication History

All formats PDF

Abstract

Phishing attacks arrive in high numbers and often spread quickly, meaning that after-the-fact countermeasures such as domain blacklisting are limited in efficacy. Visual similarity-based approaches have the potential of detecting previously unseen phishing webpages. These approaches, however, require identifying the legitimate webpage(s) they reproduce. Existing approaches rely on textual feature analysis for target identification, with misclassification rates of approximately 1%; however, as most websites a user might visit are legitimate, additional research is needed to further reduce classification errors. In this work, we propose a novel method for target identification that relies on both visual features (extracted from a screenshot of the web page) and textual features (extracted from the DOM of the web page) to identify which website a phishing web page is replicating, and assess its effectiveness in detecting phishing websites using data from phishing aggregators such as OpenPhish, PhishTank and PhishStats. Compared to state-of-the-art text-based classifiers, our method reduces the phishing misclassification rate by 67% (from 1.02% to 0.34%), for an accuracy of 99.66%. This work provides a further step forwards toward semi-automated decision support systems for phishing detection.

References

[1]

N. Abdelhamid, A. Ayesh, and F. Thabtah. 2014. Phishing detection based Associative Classification data mining. Expert Syst. Appl. 41(2014), 5948–5959.

[2]

S. Abdelnabi, K. Krombholz, and M. Fritz. 2020. VisualPhishNet: Zero-Day Phishing Website Detection by Visual Similarity. In Conference on Computer and Communications Security. ACM, 1681–1698.

[3]

M. Adebowale, K. Lwin, E. Sánchez, and M. Hossain. 2019. Intelligent web-phishing detection and protection scheme using integrated features of Images, frames and text. Expert Systems with Applications 115 (2019), 300–313.

[4]

S. Afroz and R. Greenstadt. 2011. PhishZoo: Detecting Phishing Websites by Looking at Them. In Int. Conference on Semantic Computing. IEEE, 368–375.

[5]

APWG. 2020. Phishing Activity Trends Report 1st quarter 2020 plus COVID-19 coverage. https://docs.apwg.org/reports/apwg_trends_report_q1_2020.pdf

[6]

G. Bradski and A. Kaehler. 2008. Learning OpenCV. O’Reilly Media, Inc.

[7]

J. Bórquez. 2020. Convert any image to pure CSS. https://javier.xyz/img2css/

[8]

K. Chiew, E. Chang, S. Sze, and W. Tiong. 2015. Utilisation of website logo for phishing detection. Computers & Security 54(2015), 16–26.

Digital Library

[9]

K. Chiew, J. Choo, S. Sze, and K. Yong. 2018. Leverage Website Favicon to Detect Phishing Websites. Security and Communication Networks(2018), 1–11.

[10]

R. Dhamija, J. Tygar, and M. Hearst. 2006. Why Phishing Works. In SIGCHI Conference on Human Factors in Computing Systems. ACM, 581–590.

[11]

Y. Ding, N. Luktarhan, K. Li, and W. Slamu. 2019. A keyword-based combination approach for detecting phishing webpages. Computers & Security 84(2019), 256–275.

Digital Library

[12]

[12] DMOZ.2017. http://dmoz-odp.org/Accessed: 2021-03-21.

[13]

A. Fu, L. Wenyin, and X. Deng. 2006. Detecting Phishing Web Pages with Visual Similarity Assessment Based on Earth Mover’s Distance (EMD). IEEE Trans. Dependable Secure Comput. 3, 4 (2006), 301–311.

Digital Library

[14]

Google. 2021. Google Safe Browsing. https://safebrowsing.google.com/ Accessed: 2021-03-21.

[15]

B. Gupta, N. Arachchilage, and K. Psannis. 2018. Defending against Phishing Attacks: Taxonomy of Methods, Current Issues and Future Directions. Telecommun. Syst. 67, 2 (2018), 247–267.

Digital Library

[16]

X. Han, N. Kheir, and D. Balzarotti. 2016. PhishEye: Live Monitoring of Sandboxed Phishing Kits. In CCS. ACM, 1402–1413.

[17]

Alexa Internet Inc.2021. Alexa - Top sites. https://www.alexa.com/topsites Accessed: 2021-03-21.

[18]

A. Jain and B. Gupta. 2017. Phishing Detection: Analysis of Visual Similarity Based Approaches. Security and Communication Networks 2017 (01 2017), 1–20.

[19]

A. Krizhevsky, I. Sutskever, and G. Hinton. 2017. ImageNet Classification with Deep Convolutional Neural Networks. Comm. ACM (2017), 84–90.

[20]

G. Liu, B. Qiu, and W. Liu. 2010. Automatic Detection of Phishing Target from Phishing Webpage. In International Conference on Pattern Recognition. 4153–4156.

[21]

S. Marchal, K. Saari, N. Singh, and N. Asokan. 2015. Know Your Phish: Novel Techniques for Detecting Phishing Sites and their Targets. International Conference on Distributed Computing Systems (2015), 323–333.

[22]

E. Medvet, E. Kirda, and C. Kruegel. 2008. Visual-Similarity-Based Phishing Detection. In International Conference on Security and Privacy in Communication Networks. ACM, Article 22, 6 pages.

[23]

R. Mohammad and L. McCluskey. 2015. Phishing Websites Data Set. https://archive.ics.uci.edu/ml/datasets/phishing+websites/ Accessed: 2021-03-21.

[24]

V. Muppavarapu, A. Rajendran, and S. Vasudevan. 2018. Phishing detection using RDF and random forests. Int. Arab. J. Inf. Technol. 15 (2018), 817–824.

[25]

OpenPhish. 2020. OpenPhish - Phishing Intelligence. https://openphish.com/.

[26]

N. Otsu. 1979. A Threshold Selection Method from Gray-Level Histograms. IEEE Transactions on Systems, Man, and Cybernetics 9, 1(1979), 62–66.

[27]

D. Pan, Y.and Xuhua. 2006. Anomaly Based Web Phishing Page Detection. In Annual Computer Security Applications Conference. 381–392.

[28]

P. Peng, C. Xu, L. Quinn, H. Hu, B. Viswanath, and G. Wang. 2019. What Happens After You Leak Your Password: Understanding Credential Sharing on Phishing Sites. In Asia Conference on Computer & Communications Security. ACM, 181–192.

[29]

[29] PhishStats.2021. https://phishstats.info/. Accessed: 2021-03-21.

[30]

PhishTank. 2020. Join the fight against phishing. https://www.phishtank.com/.

[31]

G. Ramesh, I. Krishnamurthi, and K. Kumar. 2014. An efficacious method for detecting phishing webpages through target domain identification. Decision Support Systems 61 (2014), 12–22.

[32]

L. Richardson. 2020. BeautifulSoup. https://pypi.org/project/beautifulsoup4. Accessed: 2021-03-21.

[33]

S. Rose, D. Engel, N. Cramer, and W. Cowley. 2010. Automatic Keyword Extraction from Individual Documents. In Text Mining: Applications and Theory. Wiley, 1–20.

[34]

E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. 2011. ORB: an efficient alternative to SIFT or SURF. In International Conference on Computer Vision. IEEE, 2564–2571.

[35]

S. Schechter, R. Dhamija, A.y Ozment, and I. Fischer. 2007. The Emperor’s New Security Indicators. In Symp. on Security & Privacy. IEEE, 51–65.

[36]

Q. Scheitle, O. Hohlfeld, J. Gamba, J. Jelten, T. Zimmermann, S. Strowes, and N. Vallina-Rodriguez. 2018. A Long Way to the Top. In Internet Measurement Conference. ACM.

[37]

J. Serra. 1983. Image Analysis and Mathematical Morphology. Academic Press.

Digital Library

[38]

N. Shekokar, C. Shah, M. Mahajan, and S. Rachh. 2015. An Ideal Approach for Detection and Prevention of Phishing Attacks. Procedia Computer Science 49 (2015), 82–91.

[39]

S. Sheng, B. Wardman, G. Warner, L. Cranor, J. Hong, and C. Zhang. 2009. An Empirical Analysis of Phishing Blacklists. In Conference on Email and Anti-Spam.

[40]

S. Suzuki and K. Abe. 1985. Topological structural analysis of digitized binary images by border following. Comput. Vis. Graph. Image Process. 30 (1985), 32–46.

[41]

Choon Lin Tan. 2018. Phishing Dataset for Machine Learning: Feature Evaluation.

[42]

L. Wang, Y. Zhang, and J. Feng. 2005. On the Euclidean distance of images. IEEE Trans. on Pattern Analysis & Machine Intelligence 27, 8(2005), 1334–1339.

Digital Library

[43]

Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13, 4 (2004), 600–612.

Digital Library

[44]

L. Wenyin, G. Liu, B. Qiu, and X. Quan. 2012. Antiphishing through Phishing Target Discovery. IEEE Internet Computing 16 (2012), 52–61.

Digital Library

[45]

Q. Ye, J. Jiao, J. Huang, and H. Yu. 2007. Text detection and restoration in natural scene images. J Vis Commun Image Represent. 18, 6 (2007), 504–513.

Digital Library

[46]

H. Zhang, G. Liu, T. Chow, and W. Liu. 2011. Textual and Visual Content-Based Anti-Phishing: A Bayesian Approach. IEEE Trans on Neural Networks 22, 10 (2011), 1532–1546.

Digital Library

Cited By

Yuan YApruzzese GConti M(2024)Multi-SpacePhish: Extending the Evasion-space of Adversarial Attacks against Phishing Website Detectors Using Machine LearningDigital Threats: Research and Practice10.1145/36382535:2(1-51)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3638253
Burda PAllodi LZannone N(2024)Cognition in Social Engineering Empirical Research: A Systematic Literature ReviewACM Transactions on Computer-Human Interaction10.1145/363514931:2(1-55)Online publication date: 29-Jan-2024
https://dl.acm.org/doi/10.1145/3635149
Loiseau GLefils VMeyer MRiquet DVilela JSchulmann HLi N(2024)WikiPhish: A Diverse Wikipedia-Based Dataset for Phishing Website Detection: Data/Toolset PaperProceedings of the Fourteenth ACM Conference on Data and Application Security and Privacy10.1145/3626232.3653283(361-366)Online publication date: 19-Jun-2024
https://dl.acm.org/doi/10.1145/3626232.3653283
Show More Cited By

Recommendations

VisualPhishNet: Zero-Day Phishing Website Detection by Visual Similarity
CCS '20: Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security

Phishing websites are still a major threat in today's Internet ecosystem. Despite numerous previous efforts, similarity-based detection methods do not offer sufficient protection for the trusted websites, in particular against unseen phishing pages. ...
Phishing to improve detection
EuroUSEC '23: Proceedings of the 2023 European Symposium on Usable Security

Phishing e-mail scams continue to threaten organisations around the world. With generative artificial intelligence, conventional phishing detection advice such as looking out for linguistic errors and bad layouts will become obsolete. New approaches to ...
Lexical feature based phishing URL detection using online learning
AISec '10: Proceedings of the 3rd ACM workshop on Artificial intelligence and security

Phishing is a form of cybercrime where spammed emails and fraudulent websites entice victims to provide sensitive information to the phishers. The acquired sensitive information is subsequently used to steal identities or gain access to money. This ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ARES '21: Proceedings of the 16th International Conference on Availability, Reliability and Security

August 2021

1447 pages

ISBN:9781450390514

DOI:10.1145/3465481

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 August 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

ITEA3

Conference

ARES 2021

ARES 2021: The 16th International Conference on Availability, Reliability and Security

August 17 - 20, 2021

Vienna, Austria

Acceptance Rates

Overall Acceptance Rate 228 of 451 submissions, 51%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
1,015
Total Downloads

Downloads (Last 12 months)425
Downloads (Last 6 weeks)48

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yuan YApruzzese GConti M(2024)Multi-SpacePhish: Extending the Evasion-space of Adversarial Attacks against Phishing Website Detectors Using Machine LearningDigital Threats: Research and Practice10.1145/36382535:2(1-51)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3638253
Burda PAllodi LZannone N(2024)Cognition in Social Engineering Empirical Research: A Systematic Literature ReviewACM Transactions on Computer-Human Interaction10.1145/363514931:2(1-55)Online publication date: 29-Jan-2024
https://dl.acm.org/doi/10.1145/3635149
Loiseau GLefils VMeyer MRiquet DVilela JSchulmann HLi N(2024)WikiPhish: A Diverse Wikipedia-Based Dataset for Phishing Website Detection: Data/Toolset PaperProceedings of the Fourteenth ACM Conference on Data and Application Security and Privacy10.1145/3626232.3653283(361-366)Online publication date: 19-Jun-2024
https://dl.acm.org/doi/10.1145/3626232.3653283
Rashid FDoyle BHan SSeneviratne S(2024)Phishing URL detection generalisation using Unsupervised Domain AdaptationComputer Networks10.1016/j.comnet.2024.110398245(110398)Online publication date: May-2024
https://doi.org/10.1016/j.comnet.2024.110398
Albahadili AAkbas ARahebi J(2024)Detection of phishing URLs with deep learning based on GAN-CNN-LSTM network and swarm intelligence algorithmsSignal, Image and Video Processing10.1007/s11760-024-03204-218:6-7(4979-4995)Online publication date: 17-Jun-2024
https://doi.org/10.1007/s11760-024-03204-2
Draganovic ADambra SIuit JRoundy KApruzzese G(2023)“Do Users Fall for Real Adversarial Phishing?” Investigating the Human Response to Evasive Webpages2023 APWG Symposium on Electronic Crime Research (eCrime)10.1109/eCrime61234.2023.10485552(1-14)Online publication date: 15-Nov-2023
https://doi.org/10.1109/eCrime61234.2023.10485552
Boyapati MAygun R(2023)Phishing Web Page Detection using Web ScrapingSoutheastCon 202310.1109/SoutheastCon51012.2023.10115148(167-174)Online publication date: 1-Apr-2023
https://doi.org/10.1109/SoutheastCon51012.2023.10115148
Nabila OWicaksono HGirinoto Yasa RSetiawan H(2023)Benchmarking Model URL Features and Image Based for Phishing URL Detection2023 International Conference on Informatics, Multimedia, Cyber and Informations System (ICIMCIS)10.1109/ICIMCIS60089.2023.10349059(177-182)Online publication date: 7-Nov-2023
https://doi.org/10.1109/ICIMCIS60089.2023.10349059
Safi ASingh S(2023)A systematic literature review on phishing website detection techniquesJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2023.01.00435:2(590-611)Online publication date: 1-Feb-2023
https://dl.acm.org/doi/10.1016/j.jksuci.2023.01.004
Tan CChiew KYong KSebastian YThan JTiong W(2023)Hybrid phishing detection using joint visual and textual identityExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.119723220:COnline publication date: 15-Jun-2023
https://dl.acm.org/doi/10.1016/j.eswa.2023.119723
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents