Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Detecting visually similar Web pages: Application to phishing detection

Published: 10 June 2010 Publication History

Abstract

We propose a novel approach for detecting visual similarity between two Web pages. The proposed approach applies Gestalt theory and considers a Web page as a single indivisible entity. The concept of supersignals, as a realization of Gestalt principles, supports our contention that Web pages must be treated as indivisible entities. We objectify, and directly compare, these indivisible supersignals using algorithmic complexity theory. We illustrate our approach by applying it to the problem of detecting phishing scams. Via a large-scale, real-world case study, we demonstrate that 1) our approach effectively detects similar Web pages; and 2) it accuractely distinguishes legitimate and phishing pages.

References

[1]
Andresen, D., Yang, T., Egecioglu, O., Ibarra, O. H., and Smith, T. R. 1996. Scalability issues for high performance digital libraries on the World Wide Web. In Proceedings of the IEEE Forum on Research and Technology Advances in Digital Libraries.
[2]
APWG. 2008. Phishing Attack Trends Report (Jan.). Anti-Phishing Working Group, http://www.antiphishing.org.
[3]
APWG. 2009. APWG. The Anti-Phishing Working Group, http://www.antiphishing.org.
[4]
Avidan, S. and Shamir, A. 2007. Seam carving for content-aware image resizing. ACM Trans. Graph. 26, 3, 10, 1--9.
[5]
Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A. F., and Nielsen, H. 2000. Assessing the accuracy of prediction algorithms for classification: An overview. Bioinform. Rev. 16, 5, 412--424.
[6]
Bardera, A., Feixas, M., Boada, I., and Sbert, M. 2006. Compression-based image registration. In Proceedings of the IEEE International Symposium on Information Theory.
[7]
Batista, L. V., Meira, M. M., and Canalcanti jr., N. L. 2005. Texture classification using local and global histogram equalization and the Lempel-Ziv-Welch algorithm. In Proceedings of the 5th International Conference on Hybrid Intelligent Systems.
[8]
Bell, T., Cleary, J., and Witten, I. 1984. Data compression using adaptive coding and partial string matching. IEEE Trans Comm. 32, 4, 396--402.
[9]
Brandao, T. and M. P. Queluz. 2008. No-reference image quality assessment based on DCT domain statistics. Signal Process. 88, 4, 822--833.
[10]
Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G. 1997. Syntactic clustering of the Web. Comput. Netw. ISDN Syst. 29, 8--13, 1157--1166.
[11]
Burrows, M. and Wheeler, D. J. 1994. A block-sorting loss less data compression algorithm. Tech. rep., Digital Systems Research Center.
[12]
Cai, D., Yu, S., Wen, J. R., and Ma, W. Y. 2003. Extracting content structure for Web pages based on visual representation. In Proceedings of the 5th Asian-Pacific Web Conference on Web Technologies and Applications. Lecture Notes in Computer Science, vol. 2642, 406--417.
[13]
Cebrian, M., Alfonseca, M., and Ortega, A. 2005. Common pitfalls using the normalized compression distance: What to watch out for in a compressor. Comm. Inform. Syst. 54, 367--384.
[14]
Cebrian, M., Alfonseca, M., and Ortega, A. 2007. The normalized compression distance is resistant to noise. IEEE Trans. Inform. Theory 53, 5, 1895--1900.
[15]
Cernian, A., Carstoiu, D., and Olteanu, A. 2008. Clustering heterogeneous Web data using clustering bv compression validity. In Proceedings of the International Symposium on Symbolic and Numeric Algorithms for Scientific Computing.
[16]
Chaitin, G. 1. 1987. Algorithmic Information Theory. Cambridge University Press.
[17]
Charikar, M. S. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the ACM Symposium on Theory of Computing.
[18]
Chou, N., Ledesma, R., Teraguchi, Y., Boneh, D., and Mitchell, J. C. 2004. Client-side defense against Web-based identity theft. In Proceedings of the Annual Network and Distributed System Security Symposium.
[19]
Cilibrasi, R. and Vitanyi, P. M. B. 2005. Clustering by compression. IEEE Trans. Inform. Theory 51, 4, 1523--545.
[20]
Cranor, L., Egelman, S., Hong, J., and Zhang, Y. 2007. Phinding phish: Evaluating anti-phishing toolbars. In Proceedings of the Annual Network and Distributed System Security Symposium.
[21]
Dean, J. and Henzinger, M. R. 1999. Finding related pages in the World Wide Web. Comput. Netw. 31, 11--16, 1467--1479.
[22]
Delany, S. J. and Bridge, D. 2006. Textual case-based reasoning for spam filtering: A comparison of feature-based and feature-free approaches. Artif. Intell. Rev. 26, 75--87.
[23]
Dhamija, R. and Tygar, J. D. 2006. Why phishing works. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems.
[24]
Dhamija. R. and Tygar, J. D. 2005. The battle against phishing: Dynamic security skins. In Proceedings of the Symposium on Usable Privacy and Security.
[25]
Dorner, D. 1997. The Logic of Failure. Metropolitan Books, Cambridge. MA.
[26]
DSLReports.com. 2008. Phish tracker. http://www.dslreports.com/phishtrack.
[27]
eBay. 2008. Weleome to eBay. https://signin.ebay.com/ws/eBayISAPl.dll?Signln&ru=http%%3A %2F%2F.
[28]
Emigh, A. 2005. Online identity theft: Phishing technology, chokepoints and countermeasures. Tech rep., Radix Labs.
[29]
Feldt, R., Torkar, R., Gorschek, T., and Afzal, W. 2008. Searching for cognitively diverse tests: Towards universal test diversity metrics. In Proceedings of the IEEE International Conference on Software Testing Verification and Validation Workshop.
[30]
Fette, I., Sadeh, N., and Tomasic, A. 2007. Learning to detect phishing emails. In Proceedings of the International World Wide Web Conference.
[31]
Florencio, D. and Herley, C. 2005. Stopping a phishing attack, even when the victims ignore warnings. Tech. rep., Microsoft Research., Redmond, WA.
[32]
Fu, A. Y., Wenyin, L., and Deng, X. 2006. Detecting phishing Web pages with visual similarity assessment based on earth mover's distance (EMD). IEEE Trans. Depend. Secure Comput. 3, 4, 301--311.
[33]
Gordon, I. E. 2004. Theories of Visual Perception, 3rd Ed. Psychology Press, New York.
[34]
Graham, L. 2008. Gestalt theory in interactive media design. Human. Soc. Sci. 2, 1, 3.1--3.12.
[35]
Granados, A., Cebrian, M., Camacho, D., and Rodriguez, F. B. 2008. Evaluating the impact of information distortion on normalized compression distance. In Proceedings of the 2nd International Castle Meeting on Coding Theory and Applications.
[36]
Haveliwala, T. H., Gionis, A., Klein, D., and Indyk, P. 2002. Evaluating strategies for similarity search on the Web. In Proceedings of the International World Wide Web Conference.
[37]
Heintze, L. 1996. Scalable document fingerprinting. In Proceedings of the USENIX Workshop on Electronic Commerce.
[38]
Henzinger, M. 2006. Finding near-duplicate Web pages: A large-scale evaluation of algorithms. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval.
[39]
Hescott, B. and Koulomzin, D. 2007. On clustering images using compression. Tech. rep., Computer Science Department, Boston University.
[40]
Hou, 1. and Zhang, Y. 2003. Utilizing hyperlink transitivity to improve Web page clustering. In Proceedings of the Australasian Database Conference.
[41]
Kalviainen, M. 2007. The role of sign elements in holistic product meaning. In Proceedings of the SeFun International Seminar on Design Semiotics in Use.
[42]
Kepes, G. 1944. Language of Vision. Paul Theobald, Chicago, IL.
[43]
Lan, Y. and Harvey, R. 2005. Image classification using compression distance. In Proceedings of the 2nd International Conference on Vision, Video and Graphics.
[44]
Li, M. and Vitanyi, P. 1997. An Introduction to Kolmogorov Complexity and its Applications, 2nd Ed. Springer-Verlag, Berlin.
[45]
Li, M. and Zhu, Y. 2006. Image classification via LZ78-based string kernel: A comparative study. In Proceedings of the 10th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science, vol. 3918, 704--712.
[46]
Li, M., Chen, X., Li, X., Ma, B., and Vitanyi, P. M. B. 2004. The similarity metric. IEEE Trans. Inform. Theory 50, 12, 3250--3264.
[47]
Macedonas, A., Besiris, D., Economou, G., and Fotopoulos, S. 2008. Dictionary based color image retrieval. J. Vis. Comm. Image Rep. 19, 464--470.
[48]
Mack, A. and Rock, I. 1998a. lnattentional Blindness. MIT Press.
[49]
Mack, A. and Rock, I. 1998b. Inattentional blindness: Perception without attention. In Visual Attention, R. D. Wright Ed., Oxford University Press, Oxford, UK, 55--76.
[50]
MacKay, W. E. 1991. Triggers and barriers to customizing software. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems.
[51]
Manber, U. 1994. Finding similar files in a large file system. In Proceedings of the USENIX Winter Technical Conference.
[52]
McCall, I. 2007. Gartner survey shows phishing attacks escalated in 2007: More than &doller; 3 Billion lost to these attacks. Gartner, Inc., http://www.gartner.comit.pclgc.jsp?id=565l25.
[53]
Microsoft. 2009. Get Internet Explorer 7. http://www.microsoft.eom/windows/internet-explorer/ie7.
[54]
Mozilla. 2008. FireFox Web Brower. http://www.mozilla.com.en-US firefox/.
[55]
Mozilla. 2009. Thunderbird—Reclaim Your Inbox. http://www.mozilla.com/en-US/thunderbird.
[56]
Netcraft. 2009. Netcraft Anti-Phishing Toolbar. http: toolbar.netcraft.com.
[57]
Ofuonye, E., Beatty, P., Dick, S., and Miller, J. 2010. Prevalence and classification of Web page defects. Online Inform. Rev. 34, 1, 160--174.
[58]
OpenDNS. 2008. PhishTank. Join the fight against phishing. http://www.phishtank.com/phish_archive.php.
[59]
Pavlov, I. 2009. 7z Format. 7Zip, http://www.7-zip.org/.
[60]
Provost, F., Fawcett, T., and Kohavi, R. 1998. The case against accuracy estimation for comparing induction algorithms. In Proceedings of the International Conference on Machine Learning.
[61]
Quiney, H. M., Nugent, K. A., and Peele, A. G. 2006. Iterative image reconstruction algorithms using wave-front intensity and phase variation. Optics Lett. 30, 13, 1638--1640.
[62]
Rosiello, A. P. E., Kirda, E., Kruegel, C., and Ferrandi, F. 2007. A layout-similaritv-based approach for detecting phishing pages. In Proceedings of the IEEE International Conference on Security and Privacy in Communications Networks and the Workshops.
[63]
Rourke, L., Anderson, T., Garrison, D. R., and Archer, W. 2001. Methodological issues in the content analysis of computer conference transcripts. Int. J. Artif. Intel. Educ. 12, 8--22.
[64]
RSA. 2009. RSA Identity Protection and Veritication Suite. http://www.rsa.eom!node.aspx?id=30l7.
[65]
Salomon, D. 2007. Data Compression: The Complete Reference. Springer-Verlag.
[66]
Sheikh, H. R., Bovik, A. C., and Cormack, L. K. 2005. No-reference quality assessment using natural scene statistics JPEG2000. IEEE Trans. Image Process. 14, 11, 1918--1927.
[67]
Sheikh, H. R., Sabir, M. F., and Bovik, A. C. 2006. A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Trans. Image Process. 15, 11, 3449--3451.
[68]
Shen, D., Chen, Z., Yang, Q., Zeng, H.-J., Zhang, B., Lu, Y., and Ma, W.-Y. 2004. Web-page classification through summarization. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval.
[69]
Strimmer, K. and von Haeseler, A. 1996. Quartet puzzling: A quartet maximum-likelihood method for reconstructing tree topologies. Molec. Biol. Evol. 13, 7, 964--969.
[70]
Toet, A. and Lucassen, M. P. 2003. A new universal colour image fidelity metric. Displays 24, 4--5, 197--207.
[71]
Venkatesh Babu, R., Suresh, S., and Perkis, A. 2007. No-reference JPEG-image quality assessment using GAP-RBF. Signal Process. 87, 6, 1493--1503.
[72]
Wang, Y. and Kitsuregawa, M. 2002. Evaluating contents-link coupled Web page clustering for Web search results. In Proceedings of the International Conference on Information and Knowledge Management.
[73]
Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. 2004. Image quality assessment: From error visibility to structural simihrity. IEEE Trans. Image Process. 13, 4, 600--612.
[74]
Wang, Z., Simoncelli, E. P., and Bovik, A. C. 2003. Translation insensitive image similarity for image quality assessment. In Proceedings of the IEEE Asilomar Conference on Signals, Systems and Computers.
[75]
Wertheimer, M. 1944. Gestalt Theory. Hayes Barton Press, New York.
[76]
Wu, C.-T., Cheng, K.-T., Zhu, Q., and Wu, Y.-L. 2005. Using visual features for anti-spam filtering. In Proceedings of the IEEE International Conference on Image Processing.
[77]
Wu, M., Miller, R. C., and Garfinkel, S. L. 2006. Do security toolbars actually prevent phishing attacks? In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems.
[78]
Xiang. G. and Hong, J. 2009. A hybrid phish detection approach by identity discovery and keywords retrieval. In Proceedings of the International World-Wide Web Conference.
[79]
Yahoo. 2009. Yahoo! Personalized Sign-In Seal. https://protect.login.yahoo.com.
[80]
Yih, W., 1. Goodman, J., and Hulten, G. 2006. Learning at low false positive rates. In Proceedings of the 3rd Conference on Email and AntiSpam.
[81]
Zhang, Y., Hong, J. and Cranor, L. 2007. CANTINA: A content-based approach to detecting phishing Web sites. In Proceedings of the International World-Wide Web Conference.
[82]
Ziv. J. and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory 23, 3, 337--343.

Cited By

View all
  • (2024)Malicious Website Prediction Using Machine Learning MethodologiesREST Journal on Data Analytics and Artificial Intelligence10.46632/jdaai/3/3/153:3(118-123)Online publication date: 6-Sep-2024
  • (2024)PhiSN: Phishing URL Detection Using Segmentation and NLP FeaturesJournal of Information Processing10.2197/ipsjjip.32.97332(973-989)Online publication date: 2024
  • (2024)VORTEX : Visual phishing detectiOns aRe Through EXplanationsACM Transactions on Internet Technology10.1145/365466524:2(1-24)Online publication date: 6-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Internet Technology
ACM Transactions on Internet Technology  Volume 10, Issue 2
May 2010
123 pages
ISSN:1533-5399
EISSN:1557-6051
DOI:10.1145/1754393
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 June 2010
Accepted: 01 December 2009
Revised: 01 June 2009
Received: 01 January 2009
Published in TOIT Volume 10, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Algorithmic complexity theory
  2. Gestalt theory
  3. Web page similarity
  4. anti-phishing technologies

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)46
  • Downloads (Last 6 weeks)4
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Malicious Website Prediction Using Machine Learning MethodologiesREST Journal on Data Analytics and Artificial Intelligence10.46632/jdaai/3/3/153:3(118-123)Online publication date: 6-Sep-2024
  • (2024)PhiSN: Phishing URL Detection Using Segmentation and NLP FeaturesJournal of Information Processing10.2197/ipsjjip.32.97332(973-989)Online publication date: 2024
  • (2024)VORTEX : Visual phishing detectiOns aRe Through EXplanationsACM Transactions on Internet Technology10.1145/365466524:2(1-24)Online publication date: 6-May-2024
  • (2024)Phishing Website Detection Using Hybrid Machine Learning Model2024 International Conference on Computer, Electronics, Electrical Engineering & their Applications (IC2E3)10.1109/IC2E362166.2024.10826907(1-6)Online publication date: 6-Jun-2024
  • (2024)Phishing webpage detection based on global and local visual similarityExpert Systems with Applications10.1016/j.eswa.2024.124120252(124120)Online publication date: Oct-2024
  • (2023)E-mail Spam Detection and Phishing link Detection Using Machine LearningAdvances in Computational Intelligence in Materials Science10.53759/acims/978-9914-9946-9-8_9(47-53)Online publication date: 7-Jun-2023
  • (2023)A Good Fishman Knows All the Angles: A Critical Evaluation of Google's Phishing Page ClassifierProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security10.1145/3576915.3623199(2486-2500)Online publication date: 15-Nov-2023
  • (2023)Phishing or Not Phishing? A Survey on the Detection of Phishing WebsitesIEEE Access10.1109/ACCESS.2023.324713511(18499-18519)Online publication date: 2023
  • (2023)Methods for Automatic Web Page Layout Testing and Analysis: A ReviewIEEE Access10.1109/ACCESS.2023.324254911(13948-13964)Online publication date: 2023
  • (2023)Explainable machine learning for phishing feature detectionQuality and Reliability Engineering International10.1002/qre.341140:1(362-373)Online publication date: 17-Jul-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media