Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Discovering tasks from search engine query logs

Published: 05 August 2013 Publication History

Abstract

Although Web search engines still answer user queries with lists of ten blue links to webpages, people are increasingly issuing queries to accomplish their daily tasks (e.g., finding a recipe, booking a flight, reading online news, etc.). In this work, we propose a two-step methodology for discovering tasks that users try to perform through search engines. First, we identify user tasks from individual user sessions stored in search engine query logs. In our vision, a user task is a set of possibly noncontiguous queries (within a user search session), which refer to the same need. Second, we discover collective tasks by aggregating similar user tasks, possibly performed by distinct users. To discover user tasks, we propose query similarity functions based on unsupervised and supervised learning approaches. We present a set of query clustering methods that exploit these functions in order to detect user tasks. All the proposed solutions were evaluated on a manually-built ground truth, and two of them performed better than state-of-the-art approaches. To detect collective tasks, we propose four methods that cluster previously discovered user tasks, which in turn are represented by the bag-of-words extracted from their composing queries. These solutions were also evaluated on another manually-built ground truth.

References

[1]
Baeza-Yates, R., Gionis, A., Junqueira, F. P., Murdock, V., Plachouras, V., and Silvestri, F. 2008. Design trade-offs for search engine caching. ACM Trans. Web 2, 4, 1--28.
[2]
Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA.
[3]
Beeferman, D. and Berger, A. 2000. Agglomerative clustering of a search engine query log. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'00). ACM, New York, NY. 407--416.
[4]
Boldi, P., Bonchi, F., Castillo, C., Donato, D., Gionis, A., and Vigna, S. 2008. The query-flow graph: Model and applications. In Proceedings of the 17th ACM International Conference on Information and Knowledge Management (CIKM'08). ACM, New York, NY, 609--618.
[5]
Broder, A. 2002. A taxonomy of Web search. SIGIR Forum 36, 2, 2, 3--10.
[6]
Cao, H., Jiang, D., Pei, J., He, Q., Liao, Z., Chen, E., and Li, H. 2008. Context-aware query suggestion by mining click-through and session data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'08). ACM, New York, NY, 875--883.
[7]
Donato, D., Bonchi, F., Chi, T., and Maarek, Y. 2010. Do you want to take notes?: Identifying research missions in Yahoo! Search Pad. In Proceedings of the 19th International Conference on World Wide Web (WWW'10). ACM, New York, NY, 321--330.
[8]
Ester, M., Kriegel, H. P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'96). ACM, New York, NY, 226--231.
[9]
Fu, L., Goh, D. H.-L., Foo, S.S.-B., and Na, J.-C. 2003. Collaborative querying through a hybrid query clustering approach. In Proceedings of the 6th International Conference on Asian Digital Libraries (ICADL'03). Lecture Notes in Computer Science, vol. 2911, Springer-Verlag, Berlin Heidelberg, 111--122.
[10]
Gabrilovich, E. and Markovitch, S. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence. 6--12.
[11]
Gayo-Avello, D. 2009. A survey on session detection methods in query logs and a proposal for future evaluation. Info. Sci. 179, 12, 1822--1843.
[12]
Glance, N. S. 2001. Community search assistant. In Proceedings of the 6th ACM International Conference on Intelligent User Interfaces (IUI'01). ACM, New York, NY, 91--96.
[13]
Guo, J., Cheng, X., Xu, G., and Zhu, X. 2011. Intent-aware query similarity. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM'11). ACM, New York, NY, 259--268.
[14]
He, D. and Göker, A. 2000. Detecting session boundaries from Web user logs. In Proceedings of the 22nd Annual Colloquium on Information Retrieval Research (BCS-IRSG). 57--66.
[15]
He, D., Göker, A., and Harper, D. J. 2002. Combining evidence for automatic web session identification. Info. Process. Manage. 38, 5, 727--742.
[16]
Hopcroft, J. and Tarjan, R. 1973. Algorithm 447: Efficient algorithms for graph manipulation. Commun. ACM 16, 6, 372--378.
[17]
Jansen, B. J. and Spink, A. 2006. How are we searching the world wide Web?: A comparison of nine search engine transaction logs. Info. Process. Manage. 42, 1, 248--263.
[18]
Jansen, B. J., Spink, A., Bateman, J., and Saracevic, T. 1998. Real life information retrieval: A study of user queries on the web. SIGIR Forum 32, 1, 5--17.
[19]
Jansen, B. J., Spink, A., Blakely, C., and Koshman, S. 2007. Defining a session on Web search engines: Research articles. J. Amer. Soci. Info. Scie. Technol. 58, 6, 862--871.
[20]
Järvelin, A., Järvelin, A., and Järvelin, K. 2007. s-grams: Defining generalized n-grams for information retrieval. Info. Process. Manage. 43, 4, 1005--1019.
[21]
Jones, R. and Klinkner, K. L. 2008. Beyond the session timeout: Automatic hierarchical segmentation of search topics in query logs. In Proceedings of the 17th ACM International Conference on Information and Knowledge Management (CIKM'08). ACM, New York, NY, 699--708.
[22]
Kotov, A., Bennett, P. N., White, R. W., Dumais, S. T., and Teevan, J. 2011. Modeling and analysis of cross-session search tasks. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'11). ACM, New York, NY, 5--14.
[23]
Lau, T. and Horvitz, E. 1999. Patterns of search: Analyzing and modeling Web query refinement. In Proceedings of the 7th International Conference on User Modeling. Springer-Verlag, Berlin, 119--128.
[24]
Leacock, C. and Chodorow, M. 1998. Combining Local Context and WordNet Similarity for Word Sense Identification. The MIT Press, Cambridge, MA, 11, 265--283.
[25]
Lee, U., Liu, Z., and Cho, J. 2005. Automatic identification of user goals in Web search. In Proceedings of the 14th International World Wide Web Conference (WWW'05). ACM, New York, NY, 391--400.
[26]
Lesk, M. 1986. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the 5th ACM International Conference on Systems Documentation (SIGDOC'86). ACM, New York, NY, 24--26.
[27]
Leung, K. W. T., Ng, W., and Lee, D. L. 2008. Personalized concept-based clustering of search engine queries. IEEE Trans. Knowl. Data Engi. 20, 11, 1505--1518.
[28]
Lucchese, C., Orlando, S., Perego, R., Silvestri, F., and Tolomei, G. 2011. Identifying task-based sessions in search engine query logs. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM'11). ACM, New York, NY, 277--286.
[29]
MacQueen, J. B. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, L. M. Le Cam and J. Neyman Eds., Vol. 1. University of California Press, Berkeley, CA, 281--297.
[30]
Mei, Q., Klinkner, K., Kumar, R., and Tomkins, A. 2009. An analysis framework for search sequences. In Proceeding of the 18th Conference on Information and Knowledge Management (CIKM'09). ACM, New York, NY, 1991--1994.
[31]
Milne, D. and Witten, I. H. 2008. An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In Proceedings of the 22nd Conference on Artificial Intelligence (AAAI'08). AAAI Press, Menlo Park, CA, 25--30.
[32]
Ozmutlu, H. C. and çavdur, F. 2005. Application of automatic topic identification on excite web search engine data logs. Info. Process. Manage. 41, 5, 1243--1262.
[33]
Porter, M. F. 1980. An Algorithm for Suffix Stripping Vol. 14. Morgan Kaufmann Publishers, San Francisco, CA, 130--137.
[34]
Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Francisco, CA.
[35]
Rada, R., Mili, H., Bicknell, E., and Blettner, M. 1989. Development and application of a metric on semantic nets. IEEE Trans. Syst. Man Cybernet. 19, 1, 17--30.
[36]
Radlinski, F. and Joachims, T. 2005. Query chains: Learning to rank from implicit feedback. In Proceedings of the KDD Cup Workshop at the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'05). ACM, New York, NY, 239--248.
[37]
Raghavan, V. V. and Sever, H. 1995. On the reuse of past optimal queries. In Proceedings of the 18th ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR'95). ACM, New York, NY, 344--350.
[38]
Reed, W. 2001. The Pareto, zipf and other power laws. Econ. Lett. 74, 1, 15--19.
[39]
Resnik, P. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI). 448--453.
[40]
Richardson, M. 2008. Learning about the world through long-term query logs. ACM Trans. Web 2, 4, 1--27.
[41]
Rose, D. E. and Levinson, D. 2004. Understanding user goals in web search. In Proceedings of the 13th International World Wide Web Conference (WWW'04). ACM, New York, NY, 13--19.
[42]
Salton, G. and Mcgill, M. J. 1986. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY.
[43]
Seco, N. and Cardoso, N. 2006. Detecting user sessions in the tumba! web log. Tech. rep. Faculdade de Ciências da Universidade de Lisboa.
[44]
Shen, X., Tan, B., and Zhai, C. 2005. Implicit user modeling for personalized search. In Proceeding of the 14th Conference on Information and Knowledge Management (CIKM'05). ACM, New York, NY, 824--831.
[45]
Shi, X. and Yang, C. C. 2006. Mining related queries from search engine query logs. In Proceedings of the 15th International World Wide Web Conference (WWW'06). ACM, New York, NY, 943--944.
[46]
Silverstein, C., Marais, H., Henzinger, M., and Moricz, M. 1999. Analysis of a very large Web search engine query log. SIGIR Forum 33, 1, 6--12.
[47]
Silvestri, F. 2010. Mining Query Logs: Turning search usage data into knowledge. Found. Trends Info. Ret. 1, 1--2, 1--174.
[48]
Silvestri, F., Baraglia, R., Lucchese, C., Orlando, S., and Perego, R. 2008. (Query) history teaches everything, including the future. In Proceedings of the 6th Latin American Web Congress (LA-WEB'08). IEEE Computer Society, Washington, DC, 12--22.
[49]
Spink, A., Park, M., Jansen, B. J., and Pedersen, J. 2006. Multitasking during Web search sessions. Info. Process. Manage. 42, 1, 264--275.
[50]
Tan, P. N., Steinbach, M., and Kumar, V. 2005. Introduction to Data Mining. Addison-Wesley, Boston, MA.
[51]
Wen, J. R., Nie, J. Y., and Zhang, H. 2002. Query clustering using user logs. ACM Trans. Info. Syst. 20, 1, 59--81.
[52]
Zhao, Y. and Karypis, G. 2002. Evaluation of hierarchical clustering algorithms for document datasets. In Proceeding of the 11th Conference on Information and Knowledge Management (CIKM'02). ACM, New York, NY, 515--524.
[53]
Zhao, Y. and Karypis, G. 2004. Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learn. 55, 3, 311--331.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 31, Issue 3
July 2013
202 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/2493175
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 August 2013
Accepted: 01 March 2013
Revised: 01 June 2012
Received: 01 May 2011
Published in TOIS Volume 31, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Query log analysis
  2. collective task discovery
  3. collective tasks
  4. query clustering
  5. user search intent
  6. user search session boundaries
  7. user task discovery
  8. user tasks

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)6
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2025)Search task extraction using k-contour based recurrent deep graph clusteringEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109501139(109501)Online publication date: Jan-2025
  • (2024)Graph-SeTESInformation Sciences: an International Journal10.1016/j.ins.2024.120346665:COnline publication date: 1-Apr-2024
  • (2023)Query sampler: generating query sets for analyzing search engines using keyword research toolsPeerJ Computer Science10.7717/peerj-cs.14219(e1421)Online publication date: 7-Jun-2023
  • (2023)Representing Tasks with a Graph-Based Method for Supporting Users in Complex Search TasksProceedings of the 2023 Conference on Human Information Interaction and Retrieval10.1145/3576840.3578279(378-382)Online publication date: 19-Mar-2023
  • (2023)Recommending tasks based on search queries and missionsNatural Language Engineering10.1017/S1351324923000219(1-25)Online publication date: 17-May-2023
  • (2022)Landscape of Automated Log Analysis: A Systematic Literature Review and Mapping StudyIEEE Access10.1109/ACCESS.2022.315254910(21892-21913)Online publication date: 2022
  • (2022)An efficient hybrid query recommendation using shingling and hashing techniquesInformation Systems10.1016/j.is.2021.101928104:COnline publication date: 1-Feb-2022
  • (2021)Task Intelligence for Search and RecommendationSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S01103ED1V01Y202105ICR07413:3(1-160)Online publication date: 9-Jun-2021
  • (2021)I Know What You Need: Investigating Document Retrieval Effectiveness with Partial Session ContextsACM Transactions on Information Systems10.1145/348866740:3(1-30)Online publication date: 17-Nov-2021
  • (2021)Exploiting Real-time Search Engine Queries for Earthquake Detection: A Summary of ResultsACM Transactions on Information Systems10.1145/345384239:3(1-32)Online publication date: 25-May-2021
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media