Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3394486.3403291acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Open access

Learning to Cluster Documents into Workspaces Using Large Scale Activity Logs

Published: 20 August 2020 Publication History

Abstract

Google Drive is widely used for managing personal and work-related documents in the cloud. To help users organize their documents in Google Drive, we develop a new feature to allow users to create a set of working files for ongoing easy access, called workspace. A workspace is a cluster of documents, but unlike a typical document cluster, it contains documents that are not only topically coherent, but are also useful in the ongoing user tasks.
To alleviate the burden of creating workspaces manually, we automatically cluster documents into suggested workspaces. We go beyond the textual similarity-based unsupervised clustering paradigm and instead directly learn from users' activity for document clustering. More specifically, we extract co-access signals (i.e., whether a user accessed two documents around the same time) to measure document relatedness. We then use a neural document similarity model that incorporates text, metadata, as well as co-access features. Since human labels are often difficult or expensive to collect, we extract weak labels based on co-access data at large scale for model training. Our offline and online experiments based on Google Drive show that (a) co-access features are very effective for document clustering; (b) our weakly supervised clustering achieves comparable or even better performance compared to the models trained with human labels; and (c) the weakly supervised method leads to better workspace suggestions that the users accept more often in the production system than baseline approaches.

References

[1]
Elke Achtert, Sascha Goldhofer, Hans-Peter Kriegel, Erich Schubert, and Arthur Zimek. 2012. Evaluation of Clusterings--Metrics and Visual Support. In Proc. of ICDE. 1285--1288.
[2]
Charu C Aggarwal, Stephen C Gates, and Philip S Yu. 1999. On the merits of building categorization systems by supervised clustering. In Proc. of KDD. 352--356.
[3]
Charu C Aggarwal, Stephen C Gates, and Philip S Yu. 2004. On using partial supervision for text categorization. IEEE TKDE, Vol. 16, 2 (2004), 245--255.
[4]
Charu C Aggarwal and ChengXiang Zhai. 2012. A survey of text clustering algorithms. In Mining Text Data. Springer, 77--128.
[5]
Eugene Agichtein, Eric Brill, and Susan Dumais. 2006. Improving web search ranking by incorporating user behavior information. In Proc. of SIGIR. 19--26.
[6]
Sugato Basu, Arindam Banerjee, and Raymond Mooney. 2002. Semi-supervised clustering by seeding. In Proc. of ICML. 27--34.
[7]
Doug Beeferman and Adam Berger. 2000. Agglomerative clustering of a search engine query log. In Proc. of KDD. 407--416.
[8]
Michael Bendersky, Xuanhui Wang, Donald Metzler, and Marc Najork. 2017. Learning from user interactions in personal search via attribute parameterization. In Proc. of WSDM. 791--799.
[9]
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research, Vol. 3, Jan (2003), 993--1022.
[10]
William W Cohen and Jacob Richman. 2002. Learning to match and cluster large high-dimensional data sets for data integration. In Proc. of KDD. 475--480.
[11]
Zhuyun Dai, Chenyan Xiong, and Jamie Callan. 2016. Query-biased partitioning for selective search. In Proc. of CIKM. 1119--1128.
[12]
Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, Vol. 41, 6 (1990), 391--407.
[13]
Thomas Finley and Thorsten Joachims. 2005. Supervised clustering with support vector machines. In Proc. of ICML. 217--224.
[14]
Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proc. of SIGIR. 50--57.
[15]
Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. 2015. Deep unordered composition rivals syntactic methods for text classification. In Proc. of ACL, Vol. 1. 1681--1691.
[16]
Daxin Jiang, Jian Pei, and Hang Li. 2013. Mining search and browse logs for web search: A survey. ACM Transactions on Intelligent Systems and Technology (TIST), Vol. 4, 4 (2013), 57.
[17]
Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In Proc. of KDD. 133--142.
[18]
Thorsten Joachims, Laura A Granka, Bing Pan, Helene Hembrooke, and Geri Gay. 2005. Accurately interpreting clickthrough data as implicit feedback. In Proc. of SIGIR. 154--161.
[19]
Toshihiro Kamishima and Fumio Motoyoshi. 2003. Learning from cluster examples. Machine Learning, Vol. 53, 3 (2003), 199--233.
[20]
Diane Kelly and Jaime Teevan. 2003. Implicit feedback for inferring user preference: a bibliography. SIGIR Forum, Vol. 37, 2 (2003), 18--28.
[21]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111--3119.
[22]
Vincent Ng and Claire Cardie. 2002. Improving machine learning approaches to coreference resolution. In Proc. of ACL. 104--111.
[23]
Barbara Poblete and Ricardo Baeza-Yates. 2008. Query-sets: using implicit feedback and query patterns to organize web documents. In Proc. of WWW. 41--50.
[24]
Filip Radlinski and Thorsten Joachims. 2005. Query chains: learning to rank from implicit feedback. In Proc. of KDD. 239--248.
[25]
Gerard Salton and Michael J McGill. 1986. Introduction to Modern Information Retrieval. McGraw-Hill, Inc.
[26]
Amit Singhal. 2012. Introducing the knowledge graph: things, not strings. https://blog.google/products/search/introducing-knowledge-graph-things-not/.
[27]
Latanya Sweeney. 2002. k-anonymity: A model for protecting privacy. Intl. Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, Vol. 10, 05 (2002), 557--570.
[28]
C.J. van Rijsbergen. 1979. Information Retrieval. Butterworth.
[29]
Xuanhui Wang, Michael Bendersky, Donald Metzler, and Marc Najork. 2016. Learning to rank with selection bias in personal search. In Proc. of SIGIR. 115--124.
[30]
Xuanhui Wang and ChengXiang Zhai. 2007. Learn from web search logs to organize search results. In Proc. of SIGIR. 87--94.

Cited By

View all
  • (2024)A Comprehensive Survey on Deep Clustering: Taxonomy, Challenges, and Future DirectionsACM Computing Surveys10.1145/368903657:3(1-38)Online publication date: 11-Nov-2024
  • (2022)Toward Privacy Preservation Using Clustering Based Anonymization: Recent Advances and Future Research OutlookIEEE Access10.1109/ACCESS.2022.317521910(53066-53097)Online publication date: 2022
  • (2021)Improving Cloud Storage Search with User ActivityProceedings of the 14th ACM International Conference on Web Search and Data Mining10.1145/3437963.3441780(508-516)Online publication date: 8-Mar-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
August 2020
3664 pages
ISBN:9781450379984
DOI:10.1145/3394486
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 August 2020

Check for updates

Author Tags

  1. activity logs
  2. co-access
  3. document clustering
  4. weak-supervised clustering
  5. workspaces

Qualifiers

  • Research-article

Conference

KDD '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)131
  • Downloads (Last 6 weeks)28
Reflects downloads up to 24 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A Comprehensive Survey on Deep Clustering: Taxonomy, Challenges, and Future DirectionsACM Computing Surveys10.1145/368903657:3(1-38)Online publication date: 11-Nov-2024
  • (2022)Toward Privacy Preservation Using Clustering Based Anonymization: Recent Advances and Future Research OutlookIEEE Access10.1109/ACCESS.2022.317521910(53066-53097)Online publication date: 2022
  • (2021)Improving Cloud Storage Search with User ActivityProceedings of the 14th ACM International Conference on Web Search and Data Mining10.1145/3437963.3441780(508-516)Online publication date: 8-Mar-2021
  • (2021)When the Tab Comes Due:Challenges in the Cost Structure of Browser Tab UsageProceedings of the 2021 CHI Conference on Human Factors in Computing Systems10.1145/3411764.3445585(1-15)Online publication date: 6-May-2021

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media