Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content

Automated gathering of Web information: An in-depth examination of agents interacting with search engines

Published: 01 November 2006 Publication History


The Web has become a worldwide repository of information which individuals, companies, and organizations utilize to solve or address various information problems. Many of these Web users utilize automated agents to gather this information for them. Some assume that this approach represents a more sophisticated method of searching. However, there is little research investigating how Web agents search for online information. In this research, we first provide a classification for information agent using stages of information gathering, gathering approaches, and agent architecture. We then examine an implementation of one of the resulting classifications in detail, investigating how agents search for information on Web search engines, including the session, query, term, duration and frequency of interactions. For this temporal study, we analyzed three data sets of queries and page views from agents interacting with the Excite and AltaVista search engines from 1997 to 2002, examining approximately 900,000 queries submitted by over 3,000 agents. Findings include: (1) agent sessions are extremely interactive, with sometimes hundreds of interactions per second (2) agent queries are comparable to human searchers, with little use of query operators, (3) Web agents are searching for a relatively limited variety of information, wherein only 18% of the terms used are unique, and (4) the duration of agent-Web search engine interaction typically spans several hours. We discuss the implications for Web information agents and search engines.


Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A. and Raghavan, S. 2001. Searching the Web. ACM Trans. Internet Techn. 1, 1, 2--43.]]
Aridor, Y., Carmel, D., Maarek, Y. S., and Soffer, A. 2002. Knowledge encapsulation for focused search from pervasive devices. ACM Trans. Inform. Sys. 20, 1, 25--46.]]
Barish, G., Chen, Y., Knoblock, C. A., Minton, S., Philpot, A. G., and Shahabi, C. 2000. The TheatreLoc virtual application. In Proceedings of 12th Annual Conference on Innovative Applications of Artificial Intelligence (IAAI’00). 980--987.]]
Berry, M. and Browne, M. 1999. Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM, Philadelphia, PA.]]
Bollacker, K. D., Lawrence, S. and Giles, C. L. 1998. CiteSeer: An autonomous Web agent for automatic retrieval and identification of interesting publications. In Proceedings of the 2nd International ACM Conference on Autonomous Agents. 116--123.]]
Brandman, O., Cho, J., Garcia-Molina, H. and Shivakumar, N. 2000. Crawler-friendly Web servers. In Proceedings of the Workshop on Performance and Architecture of Web Servers (PAWS). Santa Clara, California.]]
Broder, A. Z., Najork, M. and Wiener, J. L. 2003. Efficient URL caching for world wide web crawling. In Proceedings of the 12th International Conference on World Wide Web (WWW). Budapest, Hungary, 680--689.]]
Brody, R. 2000. Illusions of plenty: The role of search engines in the structure and suppression of knowledge. In Proceedings of the IEEE International Symposium on Technology and Society. Rome, Italy, 157--161.]]
Budzik, J. and Hammond, K. 1999. Watson: Anticipating and Contextualizing Information Needs. In Proceedings of the 60nd Annual Meeting of the American Society for Information Science. 727--740.]]
Cappelli, P. 2001. Making the most of online recruiting. Harvard Bus. Rev. 79,3, 139--146.]]
Cesarano, C., D'acierno, A. and Picariello, A. 2003. An Intelligent Search Agent System for Semantic Information Retrieval on the Internet. In Proceedings of the 5th ACM International Workshop on Web Information and Data Management (WIDM'03). New Orleans, LA. 111---117.]]
Chakrabarti, S., Van Den Berg, M. and Dom, B. 1999. Focused crawling: a new approach to topic-specific Web resource discovery. Comput. Netw. 31,11--16, 1623--1640.]]
Chen, C. C., Chen, M. C. and Sun, Y. 2001a. PVA: A self-adaptive personal view agent system. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data mining (SIGKDD'01). San Francisco, CA, 257--262.]]
Chen, J., Dewitt, D. J., Tian, F. and Wang, Y. 2000. Niagara CQ: A scalable continuous query system for internet databases. In Proceedings of SIGMOD. 379--390.]]
Chen, L. and Sycara, K. 1998. WebMate: A personal agent for browsing and searching. In Proceedings of the 2nd International Conference on Autonomous Agents and Multi Agent Systems (AGENTS '98). 132--139.]]
Chen, Z. X., Meng, X. N., Fowler, R. H. and Zhu, B. 2001b. Features: Real-time adaptive feature and document learning for Web search. J. Amer. Soc. Inform. Science. 52, 8, 655--665.]]
Cyber Atlas. 1999. U.S. top 50 internet properties, Dec. 1999, at home/work combined. 1 (July 2000).]]
Cyber Atlas. 2001. U.S. top 50 internet properties, (May 2001) at home/work combined. (July 2000).]]
Cyber Atlas. 2002. (Nov. 2002) internet usage stats. (Jan. 2002).]]
Diligenti, M., Coetzee, F. M., Lawrence, S., Giles, C. L. and Gori, M. 2000. Focused crawling using context graphs. In Proceedings of the 26th International Conference on Very Large Databases (VLDB 2000). 527--534.]]
Doorenbos, B., Etzioni, O. and Weld, D. 1997. A scalable comparison-shopping agent for the World Wide Web. In Proceedings of the 1st International Conference of Autonomous Agents (AGENTS-97). Marina Del Ray, CA. 39--48.]]
Dumais, S. T. 2002. Web experiments and test collections. The 11th International World Wide Web Conference. 2003 (April).]]
Edwards, J., Mccurley, K. and Tomlin, J. 2001. An adaptive model for optimizing performance of an incremental web crawler. In Proceedings of the World Wide Web 10 Conference (WWW10). Hong Kong, 106--113.]]
Etzioni, O. 1996a. Moving Up the information food chain: Deploying softbots on the World Wide Web. In Proceedings of the 13th National Conference on Artificial Intelligence and the 8th Innovative Applications of Artificial Intelligence Conference. 1322--1326.]]
Fitzpatrick, L. and Dent, M. 1997. Automatic feedback using past queries: social searching? In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR97) Philadelphia, PA, 306--313.]]
Flake, G. W., Glover, E. J., Lawrence, S. and Giles, C. L. 2002. Extracting query modifications from nonlinear SVMs. In Proceedings of the 11th International World Wide Web Conference (WWW'02). Honolulu, HI, 317--324.]]
Ghani, R., Jones, R. and Mladenic, D. 2001a. Online learning for query generation: Finding documents matching a minority concept on the web. In Proceedings of Asia-Pacific Conference on Web Intelligence. 508--513.]]
Glance, N. S. 2001a. Community search assistant. In Proceedings of the International Conference on Intelligent User Interfaces (IUI'01). Sante Fe, NM, 91--96.]]
Glover, E. J., Flake, G. W., Lawrence, S., Birmingham, W., Kruger, A., Giles, C. L. and Pennock, D. 2001. Improving category specific Web search by learning query modifications. In Proceedings of IEEE Symposium on Application and the Internet (SAINT). 23--31.]]
Good, N. G., Schafer, J. B., Konstan, J. A., Borchers, A., Sarwar, B., Herlocker, J. and Riedl, J. 1999. Combining collaborative filtering with personal agents for better recommendations. In Proceedings of the 1999 Conference of the American Association of Artificial Intelligence (AAA-99). 439--446.]]
Hendler, J. 2001. Agents and the semantic Web. IEEE Intelligent Syst. 16, 2, 30--37.]]
Hernandez, J. C., Sierra, J. M., Ribagorda, A. and Ramos, B. 2001. Search engines as a security threat. Comput. 34, 10, 25--30.]]
Hölscher, C. and Strube, G. 2000. Web search behavior of internet experts and newbies. Int. J. Comput. Telecomm. Networ. 33, 1-6, 337--346.]]
Huhns, M. N. and Singh, M. P. 1998. Personal assistants. IEEE Internet Comput. 2, 5, 90--92.]]
Hurley, G. and Wilson, D. C. 2001a. DubLet: An online CBR system. In Proceedings of the 4th International Conference on Case-Based Reasoning, ICCBR'01. Vancouver, BC, Canada.]]
Introna, L. and Nissenbaum, H. 2000. Defining the Web: The politics of search engines. Comput. 33, 1, 54--62.]]
Jansen, B. J. and Pooch, U. 2001. Web user studies: A review and framework for future work. J. Amer. Soc. Inform. Science Techn. 52, 3, 235--246.]]
Jansen, B. J. and Spink, A. 2003. An analysis of Web information seeking and use: Documents retrieved versus documents viewed. In Proceedings of the 4th International Conference on Internet Computing. Las Vegas, NV, 65--69.]]
Jansen, B. J. and Spink, A. 2005. An analysis of Web searching by European Alltheweb.com users. Inform. Process. Manag. 42, 1, 248--263.]]
Jansen, B. J., Spink, A., Bateman, J. and Saracevic, T. 1998. Real life information retrieval: A study of user queries on the Web. SIGIR Forum. 32,1, 5--17.]]
Jansen, B. J., Spink, A. and Pederson, J. 2003a. Monsters at the gates: When Softbots visit web search engines. In Proceedings of the 4th International Conference on Internet Computing. Las Vegas, NV, 620--626.]]
Jansen, B. J., Spink, A. and Pederson, J. 2003b. Web searching agents: What are they doing out there? In Proceedings of the 2003 IEEE International Conference on Systems, Man and Cybernetics. Washington, DC, 10--16.]]
Jansen, B. J., Spink, A. and Pederson, J. 2005. Trend analysis of altaVista Web searching. J. Amer. Soc. Inform. Science Techn. 56, 6, 559--570.]]
Jansen, B. J., Spink, A. and Saracevic, T. 2000. Real life, real users, and real needs: A study and analysis of user queries on the Web. Inform. Process. Manag. 36, 2, 207--227.]]
Joachims, T., Freitag, D. and Mitchell, T. 1997. WebWatcher: A tour guide for the World Wide Web. In Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI 97). 770--775.]]
Jones, W. 2004. Finders, keepers? The present and future perfect in support of personal information management. First Monday. 9, 3.]]
Joshi, A. 2000. On proxy agents, mobility, and web access. Mobile Netw. Appl. 5, 233--241.]]
Knoblock, C. A., Minton, S., Ambite, J. L., Ashish, N., Muslea, I., Philpot, A. G. and Tejada, S. 2001a. The Ariadne approach to Web-based information integration. Int. J. Coopera. Inform. Syst. (IJCIS). 10, 12, 145--169.]]
Koster, M. 1998. The Web robots FAQ. www.robotstxt.org/wc/faq.html 15 (March 2002).]]
Lawrence, S. 2001. Online or invisible? Nature. 411,6837, 521.]]
Lawrence, S., Giles, C. L. and Bollacker, K. 1999. Digital libraries and autonomous citation indexing. IEEE Comput. 32,6, 67--71.]]
Lee, G., Lee, J.-H., Rho, H., Park, Y.-T., Choi, J. and Seo, J. 1998. Interactive NLI agent for multiagent Web search model. In Proceedings of the International Workshop on Intelligent Agents on the Internet and Web, in 4th World Congress on Expert Systems. Mexico City, Mexico, 67--74.]]
Lieberman, H., Fry, C. and Weitzman, L. 2001. Exploring the Web with reconnaissance agents. Comm. ACM. 44, 8, 69--75.]]
Lin, S.-D. and Knoblock, C. A. 2003. Exploiting a search engine to develop more flexible Web agents. In Proceedings of IEEE/WIC International Conference on Web Intelligence. 54--60.]]
Lu, H. and Sterling, L. 2000. Interoperability and semi-structured data in an open Web-based agent information system. In Proceedings of Proceedings of the Workshop on Information Systems Engineering (WISE00). Hong Kong, 80--86.]]
Madden, S., Shah, M., Hellerstein, J. M. and Raman, V. 2002. Continuously adaptive continuous queries over streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Madison, WI, 49--60.]]
Martin, I. and Jose, J. M. 2003. A personalized information retrieval tool. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'03). Toronto, Canada, 423--424.]]
Menczer, F., Pant, G., Srinivasan, P. and Ruiz, M. E. 2001. Evaluating topic-driven Web crawlers. In Proceedings of the 24th Annual Int. ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR'01). New Orleans, LA. 241--249.]]
Michalowski, M., Ambite, J. L., Knoblock, C. A., Minton, S., Thakkar, S. and Tuchinda, R. 2004. Retrieving and semantically integrating heterogeneous data from the Web. IEEE Intelligent Syst. 19, 3, 72--79.]]
Miller, E. 2004. The W3C's Semantic Web activity: An update. IEEE Intelligent Syst. 19, 95--97.]]
Mladenic, D. 1999. Text-learning and related intelligent agents: A survey. IEEE Intelligent Syst. 14, 4, 44--54.]]
Munarriz, R. A. 1997. How did it double? www.tool.com/ddouble/1997/ddouble 970812 html/. 10 November,]]
Pant, G. and Menczer, F. 2002. MySpiders: Evolve your own intelligent web crawlers. Autonom. Agents Multi-Agent Syst. 5, 221--229.]]
Pitkow, J., Schutze, H., Cass, T., Cooley, R., Turnbull, D., Edmonds, A., Adar, E. and Breuel, T. 2002. Personalized search. Commu. ACM. 45, 9, 50--55.]]
Rhodes, B. J. and Maes, P. 2000. Just-in-time information retrieval agents. IBM Syst J. 39, 3 & 4, 685--704.]]
Rowe, N. C. 2002a. Marie-4: A high-recall, self-improving web crawler that finds images using captions. IEEE Intelligent Syst.]]
Rowe, N. C. 2002b. Marie-4: A high-recall, self-improving web crawler that finds images using captions. IEEE Intelligent Syst. 17, 4, 8--15.]]
Searchtools.Com. 2001. Source Code for Web Robot Spiders.]]
Selberg, E. and Etzioni, O. 1995. Multi-service search and comparison using the metacrawler. In Proceedings of the 4th International World-Wide Web Conference. Boston, MA.]]
Shkapenyuk, V. and Suel, T. 2002. Design and implementation of a high-performance distributed Web crawler. In Proceedings of the 8th International Conference on Data Engineering. San Jose, CA, 357--368.]]
Silverstein, C., Henzinger, M., Marais, H. and Moricz, M. 1999. Analysis of a very large Web search engine query log. SIGIR Forum. 33, 1, 6--12.]]
Somlo, G. and Howe, A. E. 2001. Adaptive lightweight text filtering. In Proceedings of Intelligent Data Analysis (IDA'01). Lisbon, Portugal.]]
Somlo, G. and Howe, A. E. 2003a. Using Web helper agent profiles in query generation. In Proceedings of the 2nd International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS'03). Melbourne, Australia, 812--818.]]
Spink, A. and Jansen, B. J. 2004. Web Search: Public Searching of the Web. Kluwer, New York, NY.]]
Spink, A., Jansen, B. J., Wolfram, D. and Saracevic, T. 2002. From E-sex to E-commerce: Web search changes. IEEE Comput. 35, 3, 107--111.]]
Sullivan, D. 2002. Search Engine Math. www.searchenginewatch.com/showPage.html 11 April,]]
Sullivan, D. 2003. Search Utilities. www.searchenginewatch.com 16 (March 2002).]]
Talim, J., Liu, Z., Nain, P. and Coffman, E. G. 2001. Controlling the robots of Web search engines. In Proceedings of ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. Cambridge, MA, 236--244.]]
Thomas, C. G. and Fischer, G. 1997a. Using agents to personalize the Web. In Proceedings of the International Conference on Intelligent User Interfaces (IUI 97). Orlando, FL, 53--60.]]
Voss, A. and Kreifelts, T. 1997a. SOAP: Social agents providing people with useful information. In Proceedings of the international ACM SIGGROUP Conference on Supporting Group Work (Group97). Phoenix, AZ, 291--298.]]
Wolf, J. L., Squillante, M. S., Yu, P. S., Sethuraman, J. and Ozsen, L. 2002. Optimal crawling strategies for Web search engines. In Proceedings of WWW 2002. Honolulu, HI, 136--147.]]
Xiaohui, Z., Huayong, W., Guiran, C. and Hong, Z. 2001. An autonomous system-based distribution system for Web search. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics. Tucson, AZ, 435--440.]]
Youngblood, G. M. 1999. Web hunting: Design of a simple intelligent Web search agent. ACM Crossroads Magazine. 5, 4, 1--4.]]
Yu, E. S., Koo, P. C. and Liddy, E. D. 2000. Evolving intelligent text-based agents. In Proceedings of the 4th International Conference on Autonomous Agents (Agents00). Barcelona, Spain, 388--395.]]

Cited By

View all

Index Terms

  1. Automated gathering of Web information: An in-depth examination of agents interacting with search engines



      Information & Contributors


      Published In

      cover image ACM Transactions on Internet Technology
      ACM Transactions on Internet Technology  Volume 6, Issue 4
      November 2006
      197 pages
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]


      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 November 2006
      Published in TOIT Volume 6, Issue 4


      Request permissions for this article.

      Check for updates

      Author Tags

      1. Search engines
      2. Web searching
      3. agent searching


      • Article


      Other Metrics

      Bibliometrics & Citations


      Article Metrics

      • Downloads (Last 12 months)5
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 11 Jan 2025

      Other Metrics


      Cited By

      View all
      • (2013)Intelligent Personal Agents in Library 2.0 EnvironmentsLibrary Automation and OPAC 2.010.4018/978-1-4666-1912-8.ch007(144-160)Online publication date: 2013
      • (2011)Robots on the WebIEEE Robotics & Automation Magazine10.1109/MRA.2011.94099518:2(33-43)Online publication date: Jun-2011
      • (2011)Distributed Service-Oriented RoboticsIEEE Internet Computing10.1109/MIC.2011.3815:2(70-74)Online publication date: 1-Mar-2011
      • (2009)Understanding User-Web Interactions via Web AnalyticsSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S00191ED1V01Y200904ICR0061:1(1-102)Online publication date: Jan-2009
      • (2009)An ontology-driven approach for semantic information retrieval on the WebACM Transactions on Internet Technology10.1145/1552291.15522939:3(1-24)Online publication date: 30-Jul-2009
      • (2009)Distinguishing humans from robots in web search logsProceedings of the 2009 workshop on Web Search Click Data10.1145/1507509.1507512(15-19)Online publication date: 9-Feb-2009
      • (2009)A survey on session detection methods in query logs and a proposal for future evaluationInformation Sciences: an International Journal10.1016/j.ins.2009.01.026179:12(1822-1843)Online publication date: 1-May-2009
      • (2009)State of the Art in Semantic Focused CrawlersProceedings of the International Conference on Computational Science and Its Applications: Part II10.1007/978-3-642-02457-3_74(910-924)Online publication date: 9-Jul-2009
      • (2008)State of the art in metadata abstraction crawlers2008 IEEE International Conference on Industrial Technology10.1109/ICIT.2008.4608573(1-6)Online publication date: Apr-2008
      • (2008)Data Mining and Agent Technology: a fruitful symbiosisSoft Computing for Knowledge Discovery and Data Mining10.1007/978-0-387-69935-6_14(327-362)Online publication date: 2008
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options


      View or Download as a PDF file.



      View online with eReader.








      Share this Publication link

      Share on social media