research-article

Detecting visually similar Web pages: Application to phishing detection

Authors:

Teh-Chung Chen,

James MillerAuthors Info & Claims

ACM Transactions on Internet Technology (TOIT), Volume 10, Issue 2

Article No.: 5, Pages 1 - 38

https://doi.org/10.1145/1754393.1754394

Published: 10 June 2010 Publication History

Abstract

We propose a novel approach for detecting visual similarity between two Web pages. The proposed approach applies Gestalt theory and considers a Web page as a single indivisible entity. The concept of supersignals, as a realization of Gestalt principles, supports our contention that Web pages must be treated as indivisible entities. We objectify, and directly compare, these indivisible supersignals using algorithmic complexity theory. We illustrate our approach by applying it to the problem of detecting phishing scams. Via a large-scale, real-world case study, we demonstrate that 1) our approach effectively detects similar Web pages; and 2) it accuractely distinguishes legitimate and phishing pages.

References

[1]

Andresen, D., Yang, T., Egecioglu, O., Ibarra, O. H., and Smith, T. R. 1996. Scalability issues for high performance digital libraries on the World Wide Web. In Proceedings of the IEEE Forum on Research and Technology Advances in Digital Libraries.

Digital Library

[2]

APWG. 2008. Phishing Attack Trends Report (Jan.). Anti-Phishing Working Group, http://www.antiphishing.org.

[3]

APWG. 2009. APWG. The Anti-Phishing Working Group, http://www.antiphishing.org.

[4]

Avidan, S. and Shamir, A. 2007. Seam carving for content-aware image resizing. ACM Trans. Graph. 26, 3, 10, 1--9.

Digital Library

[5]

Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A. F., and Nielsen, H. 2000. Assessing the accuracy of prediction algorithms for classification: An overview. Bioinform. Rev. 16, 5, 412--424.

[6]

Bardera, A., Feixas, M., Boada, I., and Sbert, M. 2006. Compression-based image registration. In Proceedings of the IEEE International Symposium on Information Theory.

[7]

Batista, L. V., Meira, M. M., and Canalcanti jr., N. L. 2005. Texture classification using local and global histogram equalization and the Lempel-Ziv-Welch algorithm. In Proceedings of the 5th International Conference on Hybrid Intelligent Systems.

Digital Library

[8]

Bell, T., Cleary, J., and Witten, I. 1984. Data compression using adaptive coding and partial string matching. IEEE Trans Comm. 32, 4, 396--402.

[9]

Brandao, T. and M. P. Queluz. 2008. No-reference image quality assessment based on DCT domain statistics. Signal Process. 88, 4, 822--833.

Digital Library

[10]

Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G. 1997. Syntactic clustering of the Web. Comput. Netw. ISDN Syst. 29, 8--13, 1157--1166.

Digital Library

[11]

Burrows, M. and Wheeler, D. J. 1994. A block-sorting loss less data compression algorithm. Tech. rep., Digital Systems Research Center.

[12]

Cai, D., Yu, S., Wen, J. R., and Ma, W. Y. 2003. Extracting content structure for Web pages based on visual representation. In Proceedings of the 5th Asian-Pacific Web Conference on Web Technologies and Applications. Lecture Notes in Computer Science, vol. 2642, 406--417.

Digital Library

[13]

Cebrian, M., Alfonseca, M., and Ortega, A. 2005. Common pitfalls using the normalized compression distance: What to watch out for in a compressor. Comm. Inform. Syst. 54, 367--384.

[14]

Cebrian, M., Alfonseca, M., and Ortega, A. 2007. The normalized compression distance is resistant to noise. IEEE Trans. Inform. Theory 53, 5, 1895--1900.

Digital Library

[15]

Cernian, A., Carstoiu, D., and Olteanu, A. 2008. Clustering heterogeneous Web data using clustering bv compression validity. In Proceedings of the International Symposium on Symbolic and Numeric Algorithms for Scientific Computing.

Digital Library

[16]

Chaitin, G. 1. 1987. Algorithmic Information Theory. Cambridge University Press.

Digital Library

[17]

Charikar, M. S. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the ACM Symposium on Theory of Computing.

Digital Library

[18]

Chou, N., Ledesma, R., Teraguchi, Y., Boneh, D., and Mitchell, J. C. 2004. Client-side defense against Web-based identity theft. In Proceedings of the Annual Network and Distributed System Security Symposium.

[19]

Cilibrasi, R. and Vitanyi, P. M. B. 2005. Clustering by compression. IEEE Trans. Inform. Theory 51, 4, 1523--545.

Digital Library

[20]

Cranor, L., Egelman, S., Hong, J., and Zhang, Y. 2007. Phinding phish: Evaluating anti-phishing toolbars. In Proceedings of the Annual Network and Distributed System Security Symposium.

[21]

Dean, J. and Henzinger, M. R. 1999. Finding related pages in the World Wide Web. Comput. Netw. 31, 11--16, 1467--1479.

Digital Library

[22]

Delany, S. J. and Bridge, D. 2006. Textual case-based reasoning for spam filtering: A comparison of feature-based and feature-free approaches. Artif. Intell. Rev. 26, 75--87.

Digital Library

[23]

Dhamija, R. and Tygar, J. D. 2006. Why phishing works. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems.

Digital Library

[24]

Dhamija. R. and Tygar, J. D. 2005. The battle against phishing: Dynamic security skins. In Proceedings of the Symposium on Usable Privacy and Security.

Digital Library

[25]

Dorner, D. 1997. The Logic of Failure. Metropolitan Books, Cambridge. MA.

[26]

DSLReports.com. 2008. Phish tracker. http://www.dslreports.com/phishtrack.

[27]

eBay. 2008. Weleome to eBay. https://signin.ebay.com/ws/eBayISAPl.dll?Signln&ru=http&percnt;&percnt;3A &percnt;2F&percnt;2F.

[28]

Emigh, A. 2005. Online identity theft: Phishing technology, chokepoints and countermeasures. Tech rep., Radix Labs.

[29]

Feldt, R., Torkar, R., Gorschek, T., and Afzal, W. 2008. Searching for cognitively diverse tests: Towards universal test diversity metrics. In Proceedings of the IEEE International Conference on Software Testing Verification and Validation Workshop.

Digital Library

[30]

Fette, I., Sadeh, N., and Tomasic, A. 2007. Learning to detect phishing emails. In Proceedings of the International World Wide Web Conference.

Digital Library

[31]

Florencio, D. and Herley, C. 2005. Stopping a phishing attack, even when the victims ignore warnings. Tech. rep., Microsoft Research., Redmond, WA.

[32]

Fu, A. Y., Wenyin, L., and Deng, X. 2006. Detecting phishing Web pages with visual similarity assessment based on earth mover's distance (EMD). IEEE Trans. Depend. Secure Comput. 3, 4, 301--311.

Digital Library

[33]

Gordon, I. E. 2004. Theories of Visual Perception, 3rd Ed. Psychology Press, New York.

[34]

Graham, L. 2008. Gestalt theory in interactive media design. Human. Soc. Sci. 2, 1, 3.1--3.12.

[35]

Granados, A., Cebrian, M., Camacho, D., and Rodriguez, F. B. 2008. Evaluating the impact of information distortion on normalized compression distance. In Proceedings of the 2nd International Castle Meeting on Coding Theory and Applications.

Digital Library

[36]

Haveliwala, T. H., Gionis, A., Klein, D., and Indyk, P. 2002. Evaluating strategies for similarity search on the Web. In Proceedings of the International World Wide Web Conference.

Digital Library

[37]

Heintze, L. 1996. Scalable document fingerprinting. In Proceedings of the USENIX Workshop on Electronic Commerce.

[38]

Henzinger, M. 2006. Finding near-duplicate Web pages: A large-scale evaluation of algorithms. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval.

Digital Library

[39]

Hescott, B. and Koulomzin, D. 2007. On clustering images using compression. Tech. rep., Computer Science Department, Boston University.

[40]

Hou, 1. and Zhang, Y. 2003. Utilizing hyperlink transitivity to improve Web page clustering. In Proceedings of the Australasian Database Conference.

Digital Library

[41]

Kalviainen, M. 2007. The role of sign elements in holistic product meaning. In Proceedings of the SeFun International Seminar on Design Semiotics in Use.

[42]

Kepes, G. 1944. Language of Vision. Paul Theobald, Chicago, IL.

[43]

Lan, Y. and Harvey, R. 2005. Image classification using compression distance. In Proceedings of the 2nd International Conference on Vision, Video and Graphics.

[44]

Li, M. and Vitanyi, P. 1997. An Introduction to Kolmogorov Complexity and its Applications, 2nd Ed. Springer-Verlag, Berlin.

Digital Library

[45]

Li, M. and Zhu, Y. 2006. Image classification via LZ78-based string kernel: A comparative study. In Proceedings of the 10th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science, vol. 3918, 704--712.

Digital Library

[46]

Li, M., Chen, X., Li, X., Ma, B., and Vitanyi, P. M. B. 2004. The similarity metric. IEEE Trans. Inform. Theory 50, 12, 3250--3264.

Digital Library

[47]

Macedonas, A., Besiris, D., Economou, G., and Fotopoulos, S. 2008. Dictionary based color image retrieval. J. Vis. Comm. Image Rep. 19, 464--470.

Digital Library

[48]

Mack, A. and Rock, I. 1998a. lnattentional Blindness. MIT Press.

[49]

Mack, A. and Rock, I. 1998b. Inattentional blindness: Perception without attention. In Visual Attention, R. D. Wright Ed., Oxford University Press, Oxford, UK, 55--76.

[50]

MacKay, W. E. 1991. Triggers and barriers to customizing software. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems.

Digital Library

[51]

Manber, U. 1994. Finding similar files in a large file system. In Proceedings of the USENIX Winter Technical Conference.

Digital Library

[52]

McCall, I. 2007. Gartner survey shows phishing attacks escalated in 2007: More than &doller; 3 Billion lost to these attacks. Gartner, Inc., http://www.gartner.comit.pclgc.jsp?id=565l25.

[53]

Microsoft. 2009. Get Internet Explorer 7. http://www.microsoft.eom/windows/internet-explorer/ie7.

[54]

Mozilla. 2008. FireFox Web Brower. http://www.mozilla.com.en-US firefox/.

[55]

Mozilla. 2009. Thunderbird—Reclaim Your Inbox. http://www.mozilla.com/en-US/thunderbird.

[56]

Netcraft. 2009. Netcraft Anti-Phishing Toolbar. http: toolbar.netcraft.com.

[57]

Ofuonye, E., Beatty, P., Dick, S., and Miller, J. 2010. Prevalence and classification of Web page defects. Online Inform. Rev. 34, 1, 160--174.

[58]

OpenDNS. 2008. PhishTank. Join the fight against phishing. http://www.phishtank.com/phish_archive.php.

[59]

Pavlov, I. 2009. 7z Format. 7Zip, http://www.7-zip.org/.

[60]

Provost, F., Fawcett, T., and Kohavi, R. 1998. The case against accuracy estimation for comparing induction algorithms. In Proceedings of the International Conference on Machine Learning.

Digital Library

[61]

Quiney, H. M., Nugent, K. A., and Peele, A. G. 2006. Iterative image reconstruction algorithms using wave-front intensity and phase variation. Optics Lett. 30, 13, 1638--1640.

[62]

Rosiello, A. P. E., Kirda, E., Kruegel, C., and Ferrandi, F. 2007. A layout-similaritv-based approach for detecting phishing pages. In Proceedings of the IEEE International Conference on Security and Privacy in Communications Networks and the Workshops.

[63]

Rourke, L., Anderson, T., Garrison, D. R., and Archer, W. 2001. Methodological issues in the content analysis of computer conference transcripts. Int. J. Artif. Intel. Educ. 12, 8--22.

[64]

RSA. 2009. RSA Identity Protection and Veritication Suite. http://www.rsa.eom!node.aspx?id=30l7.

[65]

Salomon, D. 2007. Data Compression: The Complete Reference. Springer-Verlag.

Digital Library

[66]

Sheikh, H. R., Bovik, A. C., and Cormack, L. K. 2005. No-reference quality assessment using natural scene statistics JPEG2000. IEEE Trans. Image Process. 14, 11, 1918--1927.

Digital Library

[67]

Sheikh, H. R., Sabir, M. F., and Bovik, A. C. 2006. A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Trans. Image Process. 15, 11, 3449--3451.

Digital Library

[68]

Shen, D., Chen, Z., Yang, Q., Zeng, H.-J., Zhang, B., Lu, Y., and Ma, W.-Y. 2004. Web-page classification through summarization. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval.

Digital Library

[69]

Strimmer, K. and von Haeseler, A. 1996. Quartet puzzling: A quartet maximum-likelihood method for reconstructing tree topologies. Molec. Biol. Evol. 13, 7, 964--969.

[70]

Toet, A. and Lucassen, M. P. 2003. A new universal colour image fidelity metric. Displays 24, 4--5, 197--207.

[71]

Venkatesh Babu, R., Suresh, S., and Perkis, A. 2007. No-reference JPEG-image quality assessment using GAP-RBF. Signal Process. 87, 6, 1493--1503.

Digital Library

[72]

Wang, Y. and Kitsuregawa, M. 2002. Evaluating contents-link coupled Web page clustering for Web search results. In Proceedings of the International Conference on Information and Knowledge Management.

Digital Library

[73]

Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. 2004. Image quality assessment: From error visibility to structural simihrity. IEEE Trans. Image Process. 13, 4, 600--612.

Digital Library

[74]

Wang, Z., Simoncelli, E. P., and Bovik, A. C. 2003. Translation insensitive image similarity for image quality assessment. In Proceedings of the IEEE Asilomar Conference on Signals, Systems and Computers.

[75]

Wertheimer, M. 1944. Gestalt Theory. Hayes Barton Press, New York.

[76]

Wu, C.-T., Cheng, K.-T., Zhu, Q., and Wu, Y.-L. 2005. Using visual features for anti-spam filtering. In Proceedings of the IEEE International Conference on Image Processing.

[77]

Wu, M., Miller, R. C., and Garfinkel, S. L. 2006. Do security toolbars actually prevent phishing attacks? In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems.

Digital Library

[78]

Xiang. G. and Hong, J. 2009. A hybrid phish detection approach by identity discovery and keywords retrieval. In Proceedings of the International World-Wide Web Conference.

Digital Library

[79]

Yahoo. 2009. Yahoo! Personalized Sign-In Seal. https://protect.login.yahoo.com.

[80]

Yih, W., 1. Goodman, J., and Hulten, G. 2006. Learning at low false positive rates. In Proceedings of the 3rd Conference on Email and AntiSpam.

[81]

Zhang, Y., Hong, J. and Cranor, L. 2007. CANTINA: A content-based approach to detecting phishing Web sites. In Proceedings of the International World-Wide Web Conference.

Digital Library

[82]

Ziv. J. and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory 23, 3, 337--343.

Digital Library

Cited By

Sravani RSubba reddy MNagaprarthana Devi MRamasubba reddy G(2024)Malicious Website Prediction Using Machine Learning MethodologiesREST Journal on Data Analytics and Artificial Intelligence10.46632/jdaai/3/3/153:3(118-123)Online publication date: 6-Sep-2024
https://doi.org/10.46632/jdaai/3/3/15
Aung EYamana H(2024)PhiSN: Phishing URL Detection Using Segmentation and NLP FeaturesJournal of Information Processing10.2197/ipsjjip.32.97332(973-989)Online publication date: 2024
https://doi.org/10.2197/ipsjjip.32.973
Charmet FMorikawa TTanaka ATakahashi T(2024)VORTEX : Visual phishing detectiOns aRe Through EXplanationsACM Transactions on Internet Technology10.1145/365466524:2(1-24)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.1145/3654665
Show More Cited By

Index Terms

Detecting visually similar Web pages: Application to phishing detection
1. Human-centered computing
  1. Collaborative and social computing
    1. Collaborative and social computing systems and tools
2. Information systems
  1. World Wide Web

Recommendations

Detecting phishing web pages based on DOM-tree structure and graph matching algorithm
SoICT '14: Proceedings of the 5th Symposium on Information and Communication Technology

Most modern day phishing attacks occur by luring users into visiting a malicious web page that looks and behaves like the original. Phishing is a web-based attack which end users are lured to visit fraudulent websites and give away personal information ...
A survey of Web metrics

The unabated growth and increasing significance of the World Wide Web has resulted in a flurry of research activity to improve its capacity for serving information more effectively. But at the heart of these efforts lie implicit assumptions about "...
Utilizing hyperlink transitivity to improve web page clustering
ADC '03: Proceedings of the 14th Australasian database conference - Volume 17

The rapid increase of web complexity and size makes web searched results far from satisfaction in many cases due to a huge amount of information returned by search engines. How to find intrinsic relationships among the web pages at a higher level to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Internet Technology

ACM Transactions on Internet Technology Volume 10, Issue 2

May 2010

123 pages

ISSN:1533-5399

EISSN:1557-6051

DOI:10.1145/1754393

Issue’s Table of Contents

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 June 2010

Accepted: 01 December 2009

Revised: 01 June 2009

Received: 01 January 2009

Published in TOIT Volume 10, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

81
Total Citations
View Citations
2,160
Total Downloads

Downloads (Last 12 months)46
Downloads (Last 6 weeks)4

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Sravani RSubba reddy MNagaprarthana Devi MRamasubba reddy G(2024)Malicious Website Prediction Using Machine Learning MethodologiesREST Journal on Data Analytics and Artificial Intelligence10.46632/jdaai/3/3/153:3(118-123)Online publication date: 6-Sep-2024
https://doi.org/10.46632/jdaai/3/3/15
Aung EYamana H(2024)PhiSN: Phishing URL Detection Using Segmentation and NLP FeaturesJournal of Information Processing10.2197/ipsjjip.32.97332(973-989)Online publication date: 2024
https://doi.org/10.2197/ipsjjip.32.973
Charmet FMorikawa TTanaka ATakahashi T(2024)VORTEX : Visual phishing detectiOns aRe Through EXplanationsACM Transactions on Internet Technology10.1145/365466524:2(1-24)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.1145/3654665
Gunjan Prasad R(2024)Phishing Website Detection Using Hybrid Machine Learning Model2024 International Conference on Computer, Electronics, Electrical Engineering & their Applications (IC2E3)10.1109/IC2E362166.2024.10826907(1-6)Online publication date: 6-Jun-2024
https://doi.org/10.1109/IC2E362166.2024.10826907
Wang MSong LLi LZhu YLi J(2024)Phishing webpage detection based on global and local visual similarityExpert Systems with Applications10.1016/j.eswa.2024.124120252(124120)Online publication date: Oct-2024
https://doi.org/10.1016/j.eswa.2024.124120
J KA AS AB JT A(2023)E-mail Spam Detection and Phishing link Detection Using Machine LearningAdvances in Computational Intelligence in Materials Science10.53759/acims/978-9914-9946-9-8_9(47-53)Online publication date: 7-Jun-2023
https://doi.org/10.53759/acims/978-9914-9946-9-8_9
Miao CFeng JYou WShi WHuang JLiang BMeng WJensen CCremers CKirda E(2023)A Good Fishman Knows All the Angles: A Critical Evaluation of Google's Phishing Page ClassifierProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security10.1145/3576915.3623199(2486-2500)Online publication date: 15-Nov-2023
https://dl.acm.org/doi/10.1145/3576915.3623199
Zieni RMassari LCalzarossa M(2023)Phishing or Not Phishing? A Survey on the Detection of Phishing WebsitesIEEE Access10.1109/ACCESS.2023.324713511(18499-18519)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3247135
Prazina IBecirovic SCogo EOkanovic V(2023)Methods for Automatic Web Page Layout Testing and Analysis: A ReviewIEEE Access10.1109/ACCESS.2023.324254911(13948-13964)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3242549
Calzarossa MGiudici PZieni R(2023)Explainable machine learning for phishing feature detectionQuality and Reliability Engineering International10.1002/qre.341140:1(362-373)Online publication date: 17-Jul-2023
https://doi.org/10.1002/qre.3411
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents