research-article

Discovering tasks from search engine query logs

Authors:

Claudio Lucchese,

Salvatore Orlando,

Raffaele Perego,

Fabrizio Silvestri,

Gabriele TolomeiAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 31, Issue 3

Article No.: 14, Pages 1 - 43

https://doi.org/10.1145/2493175.2493179

Published: 05 August 2013 Publication History

Abstract

Although Web search engines still answer user queries with lists of ten blue links to webpages, people are increasingly issuing queries to accomplish their daily tasks (e.g., finding a recipe, booking a flight, reading online news, etc.). In this work, we propose a two-step methodology for discovering tasks that users try to perform through search engines. First, we identify user tasks from individual user sessions stored in search engine query logs. In our vision, a user task is a set of possibly noncontiguous queries (within a user search session), which refer to the same need. Second, we discover collective tasks by aggregating similar user tasks, possibly performed by distinct users. To discover user tasks, we propose query similarity functions based on unsupervised and supervised learning approaches. We present a set of query clustering methods that exploit these functions in order to detect user tasks. All the proposed solutions were evaluated on a manually-built ground truth, and two of them performed better than state-of-the-art approaches. To detect collective tasks, we propose four methods that cluster previously discovered user tasks, which in turn are represented by the bag-of-words extracted from their composing queries. These solutions were also evaluated on another manually-built ground truth.

References

[1]

Baeza-Yates, R., Gionis, A., Junqueira, F. P., Murdock, V., Plachouras, V., and Silvestri, F. 2008. Design trade-offs for search engine caching. ACM Trans. Web 2, 4, 1--28.

Digital Library

[2]

Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA.

Digital Library

[3]

Beeferman, D. and Berger, A. 2000. Agglomerative clustering of a search engine query log. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'00). ACM, New York, NY. 407--416.

Digital Library

[4]

Boldi, P., Bonchi, F., Castillo, C., Donato, D., Gionis, A., and Vigna, S. 2008. The query-flow graph: Model and applications. In Proceedings of the 17th ACM International Conference on Information and Knowledge Management (CIKM'08). ACM, New York, NY, 609--618.

Digital Library

[5]

Broder, A. 2002. A taxonomy of Web search. SIGIR Forum 36, 2, 2, 3--10.

Digital Library

[6]

Cao, H., Jiang, D., Pei, J., He, Q., Liao, Z., Chen, E., and Li, H. 2008. Context-aware query suggestion by mining click-through and session data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'08). ACM, New York, NY, 875--883.

Digital Library

[7]

Donato, D., Bonchi, F., Chi, T., and Maarek, Y. 2010. Do you want to take notes&quest;: Identifying research missions in Yahoo&excl; Search Pad. In Proceedings of the 19th International Conference on World Wide Web (WWW'10). ACM, New York, NY, 321--330.

Digital Library

[8]

Ester, M., Kriegel, H. P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'96). ACM, New York, NY, 226--231.

[9]

Fu, L., Goh, D. H.-L., Foo, S.S.-B., and Na, J.-C. 2003. Collaborative querying through a hybrid query clustering approach. In Proceedings of the 6th International Conference on Asian Digital Libraries (ICADL'03). Lecture Notes in Computer Science, vol. 2911, Springer-Verlag, Berlin Heidelberg, 111--122.

[10]

Gabrilovich, E. and Markovitch, S. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence. 6--12.

Digital Library

[11]

Gayo-Avello, D. 2009. A survey on session detection methods in query logs and a proposal for future evaluation. Info. Sci. 179, 12, 1822--1843.

Digital Library

[12]

Glance, N. S. 2001. Community search assistant. In Proceedings of the 6th ACM International Conference on Intelligent User Interfaces (IUI'01). ACM, New York, NY, 91--96.

Digital Library

[13]

Guo, J., Cheng, X., Xu, G., and Zhu, X. 2011. Intent-aware query similarity. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM'11). ACM, New York, NY, 259--268.

Digital Library

[14]

He, D. and Göker, A. 2000. Detecting session boundaries from Web user logs. In Proceedings of the 22nd Annual Colloquium on Information Retrieval Research (BCS-IRSG). 57--66.

[15]

He, D., Göker, A., and Harper, D. J. 2002. Combining evidence for automatic web session identification. Info. Process. Manage. 38, 5, 727--742.

Digital Library

[16]

Hopcroft, J. and Tarjan, R. 1973. Algorithm 447: Efficient algorithms for graph manipulation. Commun. ACM 16, 6, 372--378.

Digital Library

[17]

Jansen, B. J. and Spink, A. 2006. How are we searching the world wide Web&quest;: A comparison of nine search engine transaction logs. Info. Process. Manage. 42, 1, 248--263.

Digital Library

[18]

Jansen, B. J., Spink, A., Bateman, J., and Saracevic, T. 1998. Real life information retrieval: A study of user queries on the web. SIGIR Forum 32, 1, 5--17.

Digital Library

[19]

Jansen, B. J., Spink, A., Blakely, C., and Koshman, S. 2007. Defining a session on Web search engines: Research articles. J. Amer. Soci. Info. Scie. Technol. 58, 6, 862--871.

Digital Library

[20]

Järvelin, A., Järvelin, A., and Järvelin, K. 2007. s-grams: Defining generalized n-grams for information retrieval. Info. Process. Manage. 43, 4, 1005--1019.

Digital Library

[21]

Jones, R. and Klinkner, K. L. 2008. Beyond the session timeout: Automatic hierarchical segmentation of search topics in query logs. In Proceedings of the 17th ACM International Conference on Information and Knowledge Management (CIKM'08). ACM, New York, NY, 699--708.

Digital Library

[22]

Kotov, A., Bennett, P. N., White, R. W., Dumais, S. T., and Teevan, J. 2011. Modeling and analysis of cross-session search tasks. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'11). ACM, New York, NY, 5--14.

Digital Library

[23]

Lau, T. and Horvitz, E. 1999. Patterns of search: Analyzing and modeling Web query refinement. In Proceedings of the 7th International Conference on User Modeling. Springer-Verlag, Berlin, 119--128.

Digital Library

[24]

Leacock, C. and Chodorow, M. 1998. Combining Local Context and WordNet Similarity for Word Sense Identification. The MIT Press, Cambridge, MA, 11, 265--283.

[25]

Lee, U., Liu, Z., and Cho, J. 2005. Automatic identification of user goals in Web search. In Proceedings of the 14th International World Wide Web Conference (WWW'05). ACM, New York, NY, 391--400.

Digital Library

[26]

Lesk, M. 1986. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the 5th ACM International Conference on Systems Documentation (SIGDOC'86). ACM, New York, NY, 24--26.

Digital Library

[27]

Leung, K. W. T., Ng, W., and Lee, D. L. 2008. Personalized concept-based clustering of search engine queries. IEEE Trans. Knowl. Data Engi. 20, 11, 1505--1518.

Digital Library

[28]

Lucchese, C., Orlando, S., Perego, R., Silvestri, F., and Tolomei, G. 2011. Identifying task-based sessions in search engine query logs. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM'11). ACM, New York, NY, 277--286.

Digital Library

[29]

MacQueen, J. B. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, L. M. Le Cam and J. Neyman Eds., Vol. 1. University of California Press, Berkeley, CA, 281--297.

[30]

Mei, Q., Klinkner, K., Kumar, R., and Tomkins, A. 2009. An analysis framework for search sequences. In Proceeding of the 18th Conference on Information and Knowledge Management (CIKM'09). ACM, New York, NY, 1991--1994.

Digital Library

[31]

Milne, D. and Witten, I. H. 2008. An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In Proceedings of the 22nd Conference on Artificial Intelligence (AAAI'08). AAAI Press, Menlo Park, CA, 25--30.

[32]

Ozmutlu, H. C. and çavdur, F. 2005. Application of automatic topic identification on excite web search engine data logs. Info. Process. Manage. 41, 5, 1243--1262.

Digital Library

[33]

Porter, M. F. 1980. An Algorithm for Suffix Stripping Vol. 14. Morgan Kaufmann Publishers, San Francisco, CA, 130--137.

[34]

Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Francisco, CA.

Digital Library

[35]

Rada, R., Mili, H., Bicknell, E., and Blettner, M. 1989. Development and application of a metric on semantic nets. IEEE Trans. Syst. Man Cybernet. 19, 1, 17--30.

[36]

Radlinski, F. and Joachims, T. 2005. Query chains: Learning to rank from implicit feedback. In Proceedings of the KDD Cup Workshop at the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'05). ACM, New York, NY, 239--248.

Digital Library

[37]

Raghavan, V. V. and Sever, H. 1995. On the reuse of past optimal queries. In Proceedings of the 18th ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR'95). ACM, New York, NY, 344--350.

Digital Library

[38]

Reed, W. 2001. The Pareto, zipf and other power laws. Econ. Lett. 74, 1, 15--19.

[39]

Resnik, P. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI). 448--453.

Digital Library

[40]

Richardson, M. 2008. Learning about the world through long-term query logs. ACM Trans. Web 2, 4, 1--27.

Digital Library

[41]

Rose, D. E. and Levinson, D. 2004. Understanding user goals in web search. In Proceedings of the 13th International World Wide Web Conference (WWW'04). ACM, New York, NY, 13--19.

Digital Library

[42]

Salton, G. and Mcgill, M. J. 1986. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY.

Digital Library

[43]

Seco, N. and Cardoso, N. 2006. Detecting user sessions in the tumba&excl; web log. Tech. rep. Faculdade de Ciências da Universidade de Lisboa.

[44]

Shen, X., Tan, B., and Zhai, C. 2005. Implicit user modeling for personalized search. In Proceeding of the 14th Conference on Information and Knowledge Management (CIKM'05). ACM, New York, NY, 824--831.

Digital Library

[45]

Shi, X. and Yang, C. C. 2006. Mining related queries from search engine query logs. In Proceedings of the 15th International World Wide Web Conference (WWW'06). ACM, New York, NY, 943--944.

Digital Library

[46]

Silverstein, C., Marais, H., Henzinger, M., and Moricz, M. 1999. Analysis of a very large Web search engine query log. SIGIR Forum 33, 1, 6--12.

Digital Library

[47]

Silvestri, F. 2010. Mining Query Logs: Turning search usage data into knowledge. Found. Trends Info. Ret. 1, 1--2, 1--174.

Digital Library

[48]

Silvestri, F., Baraglia, R., Lucchese, C., Orlando, S., and Perego, R. 2008. (Query) history teaches everything, including the future. In Proceedings of the 6th Latin American Web Congress (LA-WEB'08). IEEE Computer Society, Washington, DC, 12--22.

Digital Library

[49]

Spink, A., Park, M., Jansen, B. J., and Pedersen, J. 2006. Multitasking during Web search sessions. Info. Process. Manage. 42, 1, 264--275.

Digital Library

[50]

Tan, P. N., Steinbach, M., and Kumar, V. 2005. Introduction to Data Mining. Addison-Wesley, Boston, MA.

Digital Library

[51]

Wen, J. R., Nie, J. Y., and Zhang, H. 2002. Query clustering using user logs. ACM Trans. Info. Syst. 20, 1, 59--81.

Digital Library

[52]

Zhao, Y. and Karypis, G. 2002. Evaluation of hierarchical clustering algorithms for document datasets. In Proceeding of the 11th Conference on Information and Knowledge Management (CIKM'02). ACM, New York, NY, 515--524.

Digital Library

[53]

Zhao, Y. and Karypis, G. 2004. Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learn. 55, 3, 311--331.

Digital Library

Cited By

Ates NYaslan Y(2025)Search task extraction using k-contour based recurrent deep graph clusteringEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109501139(109501)Online publication date: Jan-2025
https://doi.org/10.1016/j.engappai.2024.109501
Ates NYaslan Y(2024)Graph-SeTESInformation Sciences: an International Journal10.1016/j.ins.2024.120346665:COnline publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1016/j.ins.2024.120346
Schultheiß SLewandowski Dvon Mach SYagci N(2023)Query sampler: generating query sets for analyzing search engines using keyword research toolsPeerJ Computer Science10.7717/peerj-cs.14219(e1421)Online publication date: 7-Jun-2023
https://doi.org/10.7717/peerj-cs.1421
Show More Cited By

Index Terms

Discovering tasks from search engine query logs
1. Information systems

Recommendations

Identifying task-based sessions in search engine query logs
WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining

The research challenge addressed in this paper is to devise effective techniques for identifying task-based sessions, i.e. sets of possibly non contiguous queries issued by the user of a Web Search Engine for carrying out a given task. In order to ...
Intent mining in search query logs for automatic search script generation

Capturing users' information needs is essential in decreasing the barriers in information access. This paper mines sequences of actions called search scripts from search query logs which keep large-scale users' search experiences. Search scripts can ...
Constructing Complex Search Tasks with Coherent Subtask Search Goals

Nowadays, due to the explosive growth of web content and usage, users deal with their complex search tasks by web search engines. However, conventional search engines consider a search query corresponding only to a simple search task. In order to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems

ACM Transactions on Information Systems Volume 31, Issue 3

July 2013

202 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/2493175

Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 August 2013

Accepted: 01 March 2013

Revised: 01 June 2012

Received: 01 May 2011

Published in TOIS Volume 31, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

46
Total Citations
View Citations
803
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)6

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ates NYaslan Y(2025)Search task extraction using k-contour based recurrent deep graph clusteringEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109501139(109501)Online publication date: Jan-2025
https://doi.org/10.1016/j.engappai.2024.109501
Ates NYaslan Y(2024)Graph-SeTESInformation Sciences: an International Journal10.1016/j.ins.2024.120346665:COnline publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1016/j.ins.2024.120346
Schultheiß SLewandowski Dvon Mach SYagci N(2023)Query sampler: generating query sets for analyzing search engines using keyword research toolsPeerJ Computer Science10.7717/peerj-cs.14219(e1421)Online publication date: 7-Jun-2023
https://doi.org/10.7717/peerj-cs.1421
Sarkar SAmirizaniani MShah C(2023)Representing Tasks with a Graph-Based Method for Supporting Users in Complex Search TasksProceedings of the 2023 Conference on Human Information Interaction and Retrieval10.1145/3576840.3578279(378-382)Online publication date: 19-Mar-2023
https://dl.acm.org/doi/10.1145/3576840.3578279
Garigliotti DBalog KHose KBjerva J(2023)Recommending tasks based on search queries and missionsNatural Language Engineering10.1017/S1351324923000219(1-25)Online publication date: 17-May-2023
https://doi.org/10.1017/S1351324923000219
Korzeniowski LGoczyla K(2022)Landscape of Automated Log Analysis: A Systematic Literature Review and Mapping StudyIEEE Access10.1109/ACCESS.2022.315254910(21892-21913)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3152549
Sadoughi SZarifzadeh S(2022)An efficient hybrid query recommendation using shingling and hashing techniquesInformation Systems10.1016/j.is.2021.101928104:COnline publication date: 1-Feb-2022
https://dl.acm.org/doi/10.1016/j.is.2021.101928
Shah CWhite R(2021)Task Intelligence for Search and RecommendationSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S01103ED1V01Y202105ICR07413:3(1-160)Online publication date: 9-Jun-2021
https://doi.org/10.2200/S01103ED1V01Y202105ICR074
Sen PGanguly DJones G(2021)I Know What You Need: Investigating Document Retrieval Effectiveness with Partial Session ContextsACM Transactions on Information Systems10.1145/348866740:3(1-30)Online publication date: 17-Nov-2021
https://dl.acm.org/doi/10.1145/3488667
Zhang QZhu HLiu QChen EXiong H(2021)Exploiting Real-time Search Engine Queries for Earthquake Detection: A Summary of ResultsACM Transactions on Information Systems10.1145/345384239:3(1-32)Online publication date: 25-May-2021
https://dl.acm.org/doi/10.1145/3453842
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents