Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1507509.1507512acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Distinguishing humans from robots in web search logs: preliminary results using query rates and intervals

Published: 09 February 2009 Publication History

Abstract

The workload on web search engines is actually multiclass, being derived from the activities of both human users and automated robots. It is important to distinguish between these two classes in order to reliably characterize human web search behavior, and to study the effect of robot activity. We suggest an approach based on a multi-dimensional characterization of search sessions, and take first steps towards implementing it by studying the interaction between the query submittal rate and the minimal interval of time between different queries.

References

[1]
N. Buzikashvili, "Sliding window technique for the web log analysis". In 16th Intl. World Wide Web Conf., pp. 1213--1214, May 2007.
[2]
N. N. Buzikashvili and B. J. Jansen, "Limits of the web log analysis artifacts". In Workshop on Logging Traces of Web Activity: The Mechanics of Data Collection, May 2006.
[3]
O. Etzioni, "Moving up the information food chain: deploying softbots on the world wide web". AI Magazine 18(2), pp. 11--18, Summer 1997.
[4]
N. Geens, J. Huysmans, and J. Vanthienen, "Evaluation of web robot discovery techniques: a benchmarking study". In 6th Industrial Conf. Data Mining, pp. 121--130, Jul 2006. (LNCS vol. 4065).
[5]
B. J. Jansen, T. Mullen, A. Spink, and J. Pedersen, "Automated gathering of web information: an in-depth examination of agents interacting with search engines". ACM Trans. Internet Technology 6(4), pp. 442--464, Nov 2006.
[6]
B. J. Jansen and A. Spink, "How are we searching the world wide web? a comparison of nine search engine transaction logs". Inf. Process. & Management 42(1), pp. 248--263, Jan 2006.
[7]
A. Spink and B. J. Jansen, Web Search: Public Searching of the Web. Kluwer Academic Publishers, 2004.
[8]
A. Stassopoulou and M. D. Dikaiakos, "Web robot detection: a probabilistic reasoning approach". Computer Networks, 2009. (to appear).

Cited By

View all
  • (2023)The Problem and Its Key CharacteristicsAnalysing Web Traffic10.1007/978-3-031-32503-8_1(1-14)Online publication date: 27-Jun-2023
  • (2021)Exploiting the Community Structure of Fraudulent Keywords for Fraud Detection in Web SearchJournal of Computer Science and Technology10.1007/s11390-021-0218-236:5(1167-1183)Online publication date: 30-Sep-2021
  • (2019)Internet Memes: A Novel Approach to Distinguish Humans and Bots for AuthenticationProceedings of the Future Technologies Conference (FTC) 201910.1007/978-3-030-32520-6_16(204-222)Online publication date: 13-Oct-2019
  • Show More Cited By

Recommendations

Reviews

Donald Harris Kraft

This paper raises an interesting issue about distinguishing between humans and robots performing a search, when processing Web search logs. This may not seem like an important issue, but it is necessary to ensure proper evaluation of these logs. Moreover, one can also use the methodology employed in this paper to assess the impact of robot searches on the search engines. Duskin and Feitelson note that past work generally considered that humans do not employ Boolean operators as extensively as software agents, such as those involved with meta-search engines. In their analysis, the authors use a variety of measures, including the number of queries submitted (average rate), the minimal interval between successive queries, the rate at which queries are typed, the duration of sessions of continuous activity, the time of day of the queries, and the regularity of submitted queries (whether the same query was submitted very often or if the queries were submitted at regular intervals). The authors use three logs-AlltheWeb, AltaVista, and MSN-as their data sources, over a few days. They find that classification by number of queries and by minimal interval between queries is quite useful. Moreover, they test the thresholds by which classification (human or robot) is made, to find the best thresholds for the classification decision. Finally, Duskin and Feitelson note that this is only a first step. More testing is needed to find reliable classification methods, since they feel that one simple threshold is not accurate enough. Readers, especially those involved with Web search log analysis, should consider this most interesting paper. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WSCD '09: Proceedings of the 2009 workshop on Web Search Click Data
February 2009
95 pages
ISBN:9781605584348
DOI:10.1145/1507509
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 February 2009

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)The Problem and Its Key CharacteristicsAnalysing Web Traffic10.1007/978-3-031-32503-8_1(1-14)Online publication date: 27-Jun-2023
  • (2021)Exploiting the Community Structure of Fraudulent Keywords for Fraud Detection in Web SearchJournal of Computer Science and Technology10.1007/s11390-021-0218-236:5(1167-1183)Online publication date: 30-Sep-2021
  • (2019)Internet Memes: A Novel Approach to Distinguish Humans and Bots for AuthenticationProceedings of the Future Technologies Conference (FTC) 201910.1007/978-3-030-32520-6_16(204-222)Online publication date: 13-Oct-2019
  • (2018)Improved expert selection model for forex tradingFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-017-6472-312:3(518-527)Online publication date: 1-Jun-2018
  • (2017)A Hybrid Abnormal Advertising Traffic Detection Method2017 IEEE International Conference on Big Knowledge (ICBK)10.1109/ICBK.2017.50(236-241)Online publication date: Aug-2017
  • (2016)An integrated method for real time and offline web robot detectionExpert Systems: The Journal of Knowledge Engineering10.1111/exsy.1218433:6(592-606)Online publication date: 1-Dec-2016
  • (2016)Online Prediction for Forex with an Optimized Experts Selection ModelWeb Technologies and Applications10.1007/978-3-319-45814-4_30(371-382)Online publication date: 17-Sep-2016
  • (2015)Query or spam: Detecting fraudulent web requests using stream clustering2015 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI)10.1109/KBEI.2015.7436155(853-859)Online publication date: Nov-2015
  • (2014)Identifying user sessions from web server logs with integer programmingIntelligent Data Analysis10.5555/2595613.259561718:1(43-61)Online publication date: 1-Jan-2014
  • (2014)Aleph or Aleph-Maddah, that is the question! Spelling correction for search engine autocomplete service2014 4th International Conference on Computer and Knowledge Engineering (ICCKE)10.1109/ICCKE.2014.6993359(273-278)Online publication date: Oct-2014
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media