research-article

Open access

Learning to Cluster Documents into Workspaces Using Large Scale Activity Logs

Authors:

Michael Bendersky,

Mike ColagrossoAuthors Info & Claims

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 2416 - 2424

https://doi.org/10.1145/3394486.3403291

Published: 20 August 2020 Publication History

Abstract

Google Drive is widely used for managing personal and work-related documents in the cloud. To help users organize their documents in Google Drive, we develop a new feature to allow users to create a set of working files for ongoing easy access, called workspace. A workspace is a cluster of documents, but unlike a typical document cluster, it contains documents that are not only topically coherent, but are also useful in the ongoing user tasks.

To alleviate the burden of creating workspaces manually, we automatically cluster documents into suggested workspaces. We go beyond the textual similarity-based unsupervised clustering paradigm and instead directly learn from users' activity for document clustering. More specifically, we extract co-access signals (i.e., whether a user accessed two documents around the same time) to measure document relatedness. We then use a neural document similarity model that incorporates text, metadata, as well as co-access features. Since human labels are often difficult or expensive to collect, we extract weak labels based on co-access data at large scale for model training. Our offline and online experiments based on Google Drive show that (a) co-access features are very effective for document clustering; (b) our weakly supervised clustering achieves comparable or even better performance compared to the models trained with human labels; and (c) the weakly supervised method leads to better workspace suggestions that the users accept more often in the production system than baseline approaches.

References

[1]

Elke Achtert, Sascha Goldhofer, Hans-Peter Kriegel, Erich Schubert, and Arthur Zimek. 2012. Evaluation of Clusterings--Metrics and Visual Support. In Proc. of ICDE. 1285--1288.

Digital Library

[2]

Charu C Aggarwal, Stephen C Gates, and Philip S Yu. 1999. On the merits of building categorization systems by supervised clustering. In Proc. of KDD. 352--356.

Digital Library

[3]

Charu C Aggarwal, Stephen C Gates, and Philip S Yu. 2004. On using partial supervision for text categorization. IEEE TKDE, Vol. 16, 2 (2004), 245--255.

[4]

Charu C Aggarwal and ChengXiang Zhai. 2012. A survey of text clustering algorithms. In Mining Text Data. Springer, 77--128.

[5]

Eugene Agichtein, Eric Brill, and Susan Dumais. 2006. Improving web search ranking by incorporating user behavior information. In Proc. of SIGIR. 19--26.

Digital Library

[6]

Sugato Basu, Arindam Banerjee, and Raymond Mooney. 2002. Semi-supervised clustering by seeding. In Proc. of ICML. 27--34.

[7]

Doug Beeferman and Adam Berger. 2000. Agglomerative clustering of a search engine query log. In Proc. of KDD. 407--416.

Digital Library

[8]

Michael Bendersky, Xuanhui Wang, Donald Metzler, and Marc Najork. 2017. Learning from user interactions in personal search via attribute parameterization. In Proc. of WSDM. 791--799.

Digital Library

[9]

David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research, Vol. 3, Jan (2003), 993--1022.

Digital Library

[10]

William W Cohen and Jacob Richman. 2002. Learning to match and cluster large high-dimensional data sets for data integration. In Proc. of KDD. 475--480.

Digital Library

[11]

Zhuyun Dai, Chenyan Xiong, and Jamie Callan. 2016. Query-biased partitioning for selective search. In Proc. of CIKM. 1119--1128.

Digital Library

[12]

Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, Vol. 41, 6 (1990), 391--407.

[13]

Thomas Finley and Thorsten Joachims. 2005. Supervised clustering with support vector machines. In Proc. of ICML. 217--224.

Digital Library

[14]

Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proc. of SIGIR. 50--57.

Digital Library

[15]

Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. 2015. Deep unordered composition rivals syntactic methods for text classification. In Proc. of ACL, Vol. 1. 1681--1691.

[16]

Daxin Jiang, Jian Pei, and Hang Li. 2013. Mining search and browse logs for web search: A survey. ACM Transactions on Intelligent Systems and Technology (TIST), Vol. 4, 4 (2013), 57.

[17]

Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In Proc. of KDD. 133--142.

Digital Library

[18]

Thorsten Joachims, Laura A Granka, Bing Pan, Helene Hembrooke, and Geri Gay. 2005. Accurately interpreting clickthrough data as implicit feedback. In Proc. of SIGIR. 154--161.

Digital Library

[19]

Toshihiro Kamishima and Fumio Motoyoshi. 2003. Learning from cluster examples. Machine Learning, Vol. 53, 3 (2003), 199--233.

Digital Library

[20]

Diane Kelly and Jaime Teevan. 2003. Implicit feedback for inferring user preference: a bibliography. SIGIR Forum, Vol. 37, 2 (2003), 18--28.

Digital Library

[21]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111--3119.

[22]

Vincent Ng and Claire Cardie. 2002. Improving machine learning approaches to coreference resolution. In Proc. of ACL. 104--111.

[23]

Barbara Poblete and Ricardo Baeza-Yates. 2008. Query-sets: using implicit feedback and query patterns to organize web documents. In Proc. of WWW. 41--50.

Digital Library

[24]

Filip Radlinski and Thorsten Joachims. 2005. Query chains: learning to rank from implicit feedback. In Proc. of KDD. 239--248.

Digital Library

[25]

Gerard Salton and Michael J McGill. 1986. Introduction to Modern Information Retrieval. McGraw-Hill, Inc.

Digital Library

[26]

Amit Singhal. 2012. Introducing the knowledge graph: things, not strings. https://blog.google/products/search/introducing-knowledge-graph-things-not/.

[27]

Latanya Sweeney. 2002. k-anonymity: A model for protecting privacy. Intl. Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, Vol. 10, 05 (2002), 557--570.

Digital Library

[28]

C.J. van Rijsbergen. 1979. Information Retrieval. Butterworth.

[29]

Xuanhui Wang, Michael Bendersky, Donald Metzler, and Marc Najork. 2016. Learning to rank with selection bias in personal search. In Proc. of SIGIR. 115--124.

Digital Library

[30]

Xuanhui Wang and ChengXiang Zhai. 2007. Learn from web search logs to organize search results. In Proc. of SIGIR. 87--94.

Digital Library

Cited By

Zhou SXu HZheng ZChen JLi ZBu JWu JWang XZhu WEster M(2024)A Comprehensive Survey on Deep Clustering: Taxonomy, Challenges, and Future DirectionsACM Computing Surveys10.1145/368903657:3(1-38)Online publication date: 11-Nov-2024
https://dl.acm.org/doi/10.1145/3689036
Majeed AKhan SHwang S(2022)Toward Privacy Preservation Using Clustering Based Anonymization: Recent Advances and Future Research OutlookIEEE Access10.1109/ACCESS.2022.317521910(53066-53097)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3175219
Jagerman RKong WPasumarthi RQin ZBendersky MNajork MLewin-Eytan LCarmel DYom-Tov EAgichtein EGabrilovich E(2021)Improving Cloud Storage Search with User ActivityProceedings of the 14th ACM International Conference on Web Search and Data Mining10.1145/3437963.3441780(508-516)Online publication date: 8-Mar-2021
https://dl.acm.org/doi/10.1145/3437963.3441780
Show More Cited By

Index Terms

Learning to Cluster Documents into Workspaces Using Large Scale Activity Logs
1. Information systems
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory
      1. Unsupervised learning and clustering

Recommendations

A similarity assessment technique for effective grouping of documents

Display Omitted Document clustering refers to the task of grouping similar documents and segregating dissimilar documents. It is very useful to find meaningful categories from a large corpus. In practice, the task to categorize a corpus is not so easy, ...
Accommodating Individual Preferences in the Categorization of Documents: A Personalized Clustering Approach

As electronic commerce and knowledge economy environments proliferate, both individuals and organizations increasingly generate and consume large amounts of online information, typically available as textual documents. To manage this ever-increasing ...
Documents clustering using tolerance rough set model and its application to information retrieval
Intelligent exploration of the web

Clustering is a powerful tool for analyzing and finding useful information in text collections. However, document clustering is a difficult clustering problem because of the unstructured form and textual characteristics of documents. As a consequence, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

August 2020

3664 pages

ISBN:9781450379984

DOI:10.1145/3394486

General Chairs:
Rajesh Gupta
UC San Diego, USA
,
Yan Liu
USC, USA
,
Program Chairs:
Mohak Shah
LG Electronics, USA
,
Suju Rajan
Linkedin, USA
,
Publications Chairs:
Jiliang Tang
Michigan State, USA
,
B. Aditya Prakash
Georgia Tech, USA

Copyright © 2020 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 August 2020

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD '20

Sponsor:

KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

July 6 - 10, 2020

CA, Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
911
Total Downloads

Downloads (Last 12 months)131
Downloads (Last 6 weeks)28

Reflects downloads up to 24 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhou SXu HZheng ZChen JLi ZBu JWu JWang XZhu WEster M(2024)A Comprehensive Survey on Deep Clustering: Taxonomy, Challenges, and Future DirectionsACM Computing Surveys10.1145/368903657:3(1-38)Online publication date: 11-Nov-2024
https://dl.acm.org/doi/10.1145/3689036
Majeed AKhan SHwang S(2022)Toward Privacy Preservation Using Clustering Based Anonymization: Recent Advances and Future Research OutlookIEEE Access10.1109/ACCESS.2022.317521910(53066-53097)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3175219
Jagerman RKong WPasumarthi RQin ZBendersky MNajork MLewin-Eytan LCarmel DYom-Tov EAgichtein EGabrilovich E(2021)Improving Cloud Storage Search with User ActivityProceedings of the 14th ACM International Conference on Web Search and Data Mining10.1145/3437963.3441780(508-516)Online publication date: 8-Mar-2021
https://dl.acm.org/doi/10.1145/3437963.3441780
Chang JHahn NKim YCoupland JBreneisen BKim HHwong JKittur AKitamura YQuigley AIsbister KIgarashi TBjørn PDrucker S(2021)When the Tab Comes Due:Challenges in the Cost Structure of Browser Tab UsageProceedings of the 2021 CHI Conference on Human Factors in Computing Systems10.1145/3411764.3445585(1-15)Online publication date: 6-May-2021
https://dl.acm.org/doi/10.1145/3411764.3445585

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten