Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2492517.2500239acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

A comparison of web robot and human requests

Published: 25 August 2013 Publication History

Abstract

Sophisticated Web robots sport a wide variety of functionality and visiting characteristics, constituting a significant percentage of the requests serviced by a Web server. Unlike human clients that retrieve information off a site by navigating links and ignoring irrelevant information, Web robots may collect many different types of resources, and employ varying navigation strategies to find the knowledge on the site they desire. Thus, the resource request patterns of their visits are unpredictable and cannot be inferred based on our knowledge of human request patterns. In this paper, we perform an analysis on the types of resources requested by Web robots using recent Web logs from an academic Web server. We study the distribution of response sizes and response codes, the types of resources requested, and popularity of resources for requests from Web robots. Throughout, we contrast our findings against human resource request patterns. We find reasons to suggest that robots severely handicaps the ability of Web server caches to operate with high performance.

References

[1]
M. F. Arlitt and C. L. Williamson, "Web server workload characterization: The search for invariants," ACM SIGMETRICS Performance Evaluation Review, pp. 126--137, 1996.
[2]
F. Li, K. Goseva-Popstojanova, and A. Ross, "Discovering web workload characteristics through cluster analysis," in Proc. IEEE International Symposium on Network Computing and Applications, 2007, pp. 61--68.
[3]
J. X. Yu, Y. Ou, C. Zhang, and S. Zhang, "Identifying interesting customers through web log classification," IEEE Intelligent Systems, vol. 20, no. 3, pp. 55--59, 2005.
[4]
D. Doran and S. Gokhale, "Discovering New Trends in Web Robot Traffic Through Functional Classification," in Proc. IEEE International Symposium on Network Computing and Applications, Cambridge, MA, 2008, pp. 275--278.
[5]
D. Horowitz and S. D. Kamvar, "The anatomy of a large-scale social search engine," in Proc. of 19th Intl. World Wide Web Conference, 2010, pp. 431--440.
[6]
D. Doran and S. Gokhale, "A classification framework for web robots," Journal of American Society of Information Science and Technology, vol. 63, pp. 2549--2554, 2012.
[7]
P. Jourlin, R. Deveaud, E. Sanjuan-Ibekwe, J.-M. Francony, and F. Papa, "Design, implementation and experiment of a YeSQL Web Crawler," in Proc. of ACM SIGIR Workshop on Open Source Information Retrieval, 2012, pp. 56--59.
[8]
P. Huntington, D. Nicholas, and H. R. Jamali, "Web robot detection in the scholarly information environment," Journal of Information Science, vol. 34, no. 5, pp. 726--741, 2008.
[9]
M. D. Dikaiakos, A. Stassopoulou, and L. Papageorgiou, "An investigation of Web crawler behavior: characterization and metrics," Computer Communications, vol. 28, no. 8, pp. 880--897, 2005.
[10]
S. Anbukodi and K. Manickam, "Reducing web crawler overhead using mobile crawler," in Emerging Trends in Electrical and Computer Technology (ICETECT), 2011 International Conference on, march 2011, pp. 926--932.
[11]
M. Qureshi, A. Younus, and F. Rojas, "Analyzing the web crawler as a feed forward engine for an efficient solution to the search problem in the minimum amount of time through a distributed framework," in Information Science and Applications (ICISA), 2010 International Conference on, april 2010, pp. 1--8.
[12]
D. Doran, S. Gokhale, and A. Dagnino, "Human Sensing for Smart Cities," in Proc. of Intl. Conference on Advances in Social Network Analysis and Mining, 2013.
[13]
Y. Sun, I. Councill, and C. Giles, "The ethicality of web crawlers," in Proc. of Intl. Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 1, 2010, pp. 668--675.
[14]
M. Spiliopoulou, "Web usage mining for web site evaluation," Communications of the ACM, vol. 43, no. 8, 2000.
[15]
M.-L. Shyu, C. Haruechaiyasak, and S.-C. Chen, "Mining user access patterns with traversal constraint for predicting web page requests," Knowl. Inf. Syst., vol. 10, no. 4, pp. 515--528, 2006.
[16]
R. White and S. Drucker, "Investigating behavioral variability in web search," in Proc. of the 16th Intl. conference on World Wide Web. ACM, 2007, pp. 21--30.
[17]
X. Lin, L. Quan, and H. Wu, "An automatic scheme to categorize user sessions in modern http traffic," in Proc. of IEEE Global Telecommunications Conference (GLOBECOM 08), New Orleans, LO, November 2008, pp. 1--6.
[18]
"AW Stats - free log file analyzer for advanced statistics (GNU GPL)," 2011, http://awstats.sourceforge.net/.
[19]
D. Doran and S. S. Gokhale, "Web robot detection techniques: Overview and limitations," Data Mining and Knowledge Discovery, vol. 22, no. 1, pp. 183--210, 2011.
[20]
G. Jacob, E. Kirda, C. Kruegel, and G. Vigna, "PUBCRAWL: protecting users and businesses from CRAWLers," in Proceedings of the 21st USENIX conference on Security symposium. USENIX Association, 2012.
[21]
L. Breslau, P. Cue, P. Cao, L. Fan, G. Phillips, and S. Shenker, "Web caching and zipf-like distributions: Evidence and implications," in In INFOCOM, 1999, pp. 126--134.
[22]
A. Heydon and M. Najork, "Mercator: A scalable, extensible web crawler," World Wide Web, vol. 2, no. 4, pp. 219--229, 1999.
[23]
M. Crovella, "Performance characteristics of the world wide web," Performance Evaluation: Origins and Directions, pp. 219--232, 2000.
[24]
S. Jin and A. Bestavros, "Popularity-aware greedy dual-size web proxy caching algorithms," in Proc. of Intl. Conference on Distributed Computing Systems. IEEE, 2000, pp. 254--261.
[25]
A. Riska, M. Squillante, S.-Z. Yu, Z. Liu, and L. Zhang, "Matrix-analytic analysis of a MAP/PH/1 queue fitted to web server data," Matrix-Analytic Methods; Theory and Applications, pp. 333--356, 2002.
[26]
E. Casalicchio and S. Tucci, "Static and dynamic scheduling algorithms for scalable web server farm," in Proc. of Euromicro Workshop on Parallel and Distributed Processing. IEEE, 2001, pp. 369--376.
[27]
V. Cardellini, M. Colajanni, and P. S. Yu, "Redirection algorithms for load sharing in distributed Web-server systems," in Proc. of Intl. Confrence on Distributed Computing Systems. IEEE, 1999, pp. 528--535.
[28]
L. Lipsky, Queueing Theory: A Linear Algebraic Approach, 2nd ed. Springer-Verlag, 2009.
[29]
K. S. Trivedi, Probability and Statistics with Reliability, Queueing, and Computer Science Applications, 2nd ed. John Wiley & Sons, Inc., 2002.
[30]
D. Doran and S. S. Gokhale, "Searching For Heavy-Tails in Web Robot Traffic," in Proc. of 7th IEEE Intl. Conference on Quantitative Evaluation of Systems, 2010, pp. 282--291.
[31]
M. Arlitt and T. Jin, "Workload characterization of the 1998 world cup web site," Hewlett-Packard, Tech. Rep., 1999.
[32]
A. Clauset, C. R. Shalizi, and M. Newman, "Power-Law Distributions in Empirical Data," arXiv:0706.1062v2 {physics.data-an}, Tech. Rep., 2009.
[33]
B. J. Jansen, A. Spink, and T. Saracevic, "Real life, real users, and real needs: a study and analysis of user queries on the web," Inf. Process. Manage., vol. 36, no. 2, pp. 207--227, 2000.
[34]
L. Catledge and J. Pitkow, "Characterizing browsing strategies in the World-Wide Web," Computer Networks and ISDN systems, vol. 27, no. 6, pp. 1065--1073, 1995.
[35]
J. Lee, S. Cha, D. Lee, and H. Lee, "Classification of web robots: An empirical study based on over one billion requests," Computers & Security, vol. 28, pp. 795--802, 2009.
[36]
M. C. Calzarossa and L. Massari, "Analysis of web logs: challenges and findings," in Performance Evaluation of Computer and Communication Systems. Milestones and Future Challenges. Springer, 2011, pp. 227--239.

Cited By

View all
  • (2023)Reinforcement learning based web crawler detection for diversity and dynamicsNeurocomputing10.1016/j.neucom.2022.11.059520(115-128)Online publication date: Feb-2023
  • (2023)FRS-SIFS: fuzzy rough set session identification and feature selection in web robot detectionInternational Journal of Machine Learning and Cybernetics10.1007/s13042-023-01905-715:2(237-252)Online publication date: 10-Jul-2023
  • (2022)Semi-Supervised Self-Training Approach for Web Robots Activity Detection in WeblogEvolutionary Computing and Mobile Sustainable Networks10.1007/978-981-16-9605-3_64(911-924)Online publication date: 22-Mar-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASONAM '13: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
August 2013
1558 pages
ISBN:9781450322409
DOI:10.1145/2492517
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 August 2013

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

ASONAM '13
Sponsor:
ASONAM '13: Advances in Social Networks Analysis and Mining 2013
August 25 - 28, 2013
Ontario, Niagara, Canada

Acceptance Rates

Overall Acceptance Rate 116 of 549 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)1
Reflects downloads up to 22 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Reinforcement learning based web crawler detection for diversity and dynamicsNeurocomputing10.1016/j.neucom.2022.11.059520(115-128)Online publication date: Feb-2023
  • (2023)FRS-SIFS: fuzzy rough set session identification and feature selection in web robot detectionInternational Journal of Machine Learning and Cybernetics10.1007/s13042-023-01905-715:2(237-252)Online publication date: 10-Jul-2023
  • (2022)Semi-Supervised Self-Training Approach for Web Robots Activity Detection in WeblogEvolutionary Computing and Mobile Sustainable Networks10.1007/978-981-16-9605-3_64(911-924)Online publication date: 22-Mar-2022
  • (2020)Content-aware web robot detectionApplied Intelligence10.1007/s10489-020-01754-9Online publication date: 7-Jul-2020
  • (2019)PathMarker: protecting web contents against inside crawlersCybersecurity10.1186/s42400-019-0023-12:1Online publication date: 20-Feb-2019
  • (2019)CARONTE: Crawling Adversarial Resources Over Non-Trusted, High-Profile Environments2019 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW)10.1109/EuroSPW.2019.00055(433-442)Online publication date: Jun-2019
  • (2019)A Hybrid Approach for Recognizing Web CrawlersWireless Algorithms, Systems, and Applications10.1007/978-3-030-23597-0_41(507-519)Online publication date: 21-Jun-2019
  • (2018)Web Access Patterns of Actual Human Visitors and Web RobotsHandbook of Research on Pattern Engineering System Development for Big Data Analytics10.4018/978-1-5225-3870-7.ch012(193-215)Online publication date: 2018
  • (2018)Contrasting Web Robot and Human Behaviors with Network ModelsJournal of Communications10.12720/jcm.13.8.473-481(473-481)Online publication date: 2018
  • (2018)Web Robot Detection: A Semantic Approach2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI.2018.00150(968-974)Online publication date: Nov-2018
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media