Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2983323.2983708acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Effective and Efficient Spectral Clustering on Text and Link Data

Published: 24 October 2016 Publication History

Abstract

Clustering text and link data, as an important task in text and link analysis, aims at finding communities of linked documents by leveraging the information from both domains. Due to its improved performance over the single domain counterpart, it has attracted increasing attention from practitioners in recent years. Despite its popularity, all existing algorithms on clustering text and link data overlook the existence of domain-specific distinctions and thus result in unsatisfactory clustering quality. In this paper, we address this limitation by explicitly modeling the domain-specific distinctions in the clustering process. Specifically, we extend the idea of consensus and domain-specific subspace decomposition from flat data to graph data. Such a modeling, when coupled with a regularization to further sharpen the information distinction, makes the consensus information between text and link more accurate for clustering with both domains. The final model is cast into the spectral clustering model by imposing the subspace orthogonality. To eschew the costly eigen-decomposition required for spectral clustering and further speed-up the optimization, we take advantage of the data sparsity and the low dimensionality of subspaces, and deploy a constraint-preserving gradient method to efficiently solve the model. The experimental study on three real datasets shows that our algorithm consistently and significantly outperforms the state-of-the-art relevant algorithms in terms of both quality and efficiency.

References

[1]
C. Archambeau and F. R. Bach. Sparse probabilistic projections. In Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, pages 73--80, 2008.
[2]
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003.
[3]
J. Chang and D. M. Blei. Relational topic models for document networks. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, AISTATS 2009, Clearwater Beach, Florida, USA, April 16-18, 2009, pages 81--88, 2009.
[4]
D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford University, Stanford, CA, USA, June 29-July 2, 2000, pages 167--174, 2000.
[5]
D. A. Cohn and T. Hofmann. The missing link - A probabilistic model of document content and hypertext connectivity. In Advances in Neural Information Processing Systems 13, Papers from Neural Information Processing Systems (NIPS) 2000, Denver, CO, USA, pages 430--436, 2000.
[6]
E. Erosheva, S. Fienberg, and J. Lafferty. Mixed membership models of scientific publications. In Proceedings of the National Academy of Sciences, page 2004. press, 2004.
[7]
S. K. Gupta, D. Phung, B. Adams, T. Tran, and S. Venkatesh. Nonnegative shared subspace learning and its application to social media retrieval. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '10, pages 1169--1178, New York, NY, USA, 2010. ACM.
[8]
X. He, C. H. Q. Ding, H. Zha, and H. D. Simon. Automatic topic identification using webpage clustering. In ICDM, pages 195--202, 2001.
[9]
T. Hofmann. Probabilistic latent semantic analysis. In UAI '99: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden, July 30 - August 1, 1999, pages 289--296, 1999.
[10]
L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2(1):193--218, 1985.
[11]
Y. Jia, M. Salzmann, and T. Darrell. Factorized latent spaces with structured sparsity. In Advances in Neural Information Processing Systems 23, pages 982--990. Curran Associates, Inc., 2010.
[12]
A. Kumar, P. Rai, and H. D. III. Co-regularized multi-view spectral clustering. In 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12--14 December 2011, Granada, Spain., pages 1413--1421, 2011.
[13]
W. Li and D. Yeung. Relation regularized matrix factorization. In IJCAI 2009, Proceedings of the 21st International Joint Conference on Artificial Intelligence, Pasadena, California, USA, July 11-17, 2009, pages 1126--1131, 2009.
[14]
R. Nallapati, A. Ahmed, E. P. Xing, and W. W. Cohen. Joint latent topic models for text and citations. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 24-27, 2008, pages 542--550, 2008.
[15]
A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 14, NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada, pages 849--856, 2001.
[16]
Y. Ruan, D. Fuhry, and S. Parthasarathy. Efficient community detection in large networks using content and links. In Proceedings of the 22Nd International Conference on World Wide Web, WWW '13, pages 1089--1098, 2013.
[17]
W. Tang, Z. Lu, and I. S. Dhillon. Clustering with multiple graphs. In ICDM 2009, The Ninth IEEE International Conference on Data Mining, Miami, Florida, USA, 6-9 December 2009, pages 1016--1021, 2009.
[18]
U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395--416, 2007.
[19]
Z. Wen and W. Yin. A feasible method for optimization with orthogonality constraints. Math. Program., 142(1--2):397--434, 2013.
[20]
R. Xia, Y. Pan, L. Du, and J. Yin. Robust multi-view spectral clustering via low-rank and sparse decomposition. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, July 27-31, 2014, Québec City, Québec, Canada., pages 2149--2155, 2014.
[21]
T. Yang, R. Jin, Y. Chi, and S. Zhu. A bayesian framework for community detection integrating content and link. In UAI 2009, Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, June 18--21, 2009, pages 615--622, 2009.
[22]
T. Yang, R. Jin, Y. Chi, and S. Zhu. Combining link and content for community detection: a discriminative approach. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, June 28-July 1, 2009, pages 927--936, 2009.
[23]
S. Zhu, K. Yu, Y. Chi, and Y. Gong. Combining content and link for classification using matrix factorization. In SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, July 23-27, 2007, pages 487--494, 2007.
[24]
Y. Zhu, X. Yan, L. Getoor, and C. Moore. Scalable text and link analysis with mixed-topic link models. In The 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2013, Chicago, IL, USA, August 11-14, 2013, pages 473--481, 2013.

Cited By

View all
  • (2025)Explainable Graph Spectral Clustering of text documentsPLOS ONE10.1371/journal.pone.031323820:2(e0313238)Online publication date: 4-Feb-2025
  • (2024)Prior Indicator Guided Anchor Learning for Multi-View Subspace ClusteringIEEE Transactions on Consumer Electronics10.1109/TCE.2023.331901870:1(144-154)Online publication date: Feb-2024
  • (2024)Evaluation of Chabot Text Classification Using Machine LearningConversational Artificial Intelligence10.1002/9781394200801.ch13(199-218)Online publication date: 27-Jan-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management
October 2016
2566 pages
ISBN:9781450340731
DOI:10.1145/2983323
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 October 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. efficiency
  2. spectral clustering
  3. text and link analysis

Qualifiers

  • Research-article

Funding Sources

  • Ministry of Education of Singapore
  • Nanyang Technological University

Conference

CIKM'16
Sponsor:
CIKM'16: ACM Conference on Information and Knowledge Management
October 24 - 28, 2016
Indiana, Indianapolis, USA

Acceptance Rates

CIKM '16 Paper Acceptance Rate 160 of 701 submissions, 23%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Explainable Graph Spectral Clustering of text documentsPLOS ONE10.1371/journal.pone.031323820:2(e0313238)Online publication date: 4-Feb-2025
  • (2024)Prior Indicator Guided Anchor Learning for Multi-View Subspace ClusteringIEEE Transactions on Consumer Electronics10.1109/TCE.2023.331901870:1(144-154)Online publication date: Feb-2024
  • (2024)Evaluation of Chabot Text Classification Using Machine LearningConversational Artificial Intelligence10.1002/9781394200801.ch13(199-218)Online publication date: 27-Jan-2024
  • (2022)Robust semi-supervised clustering via data transductive warpingApplied Intelligence10.1007/s10489-022-03493-553:2(1254-1270)Online publication date: 27-Apr-2022
  • (2022)Postimpact similarity: a similarity measure for effective grouping of unlabelled text using spectral clusteringKnowledge and Information Systems10.1007/s10115-022-01658-9Online publication date: 17-Feb-2022
  • (2021)Joint sparsity-biased variational graph autoencodersThe Journal of Defense Modeling and Simulation: Applications, Methodology, Technology10.1177/154851292199682818:3(239-246)Online publication date: 9-Mar-2021
  • (2021)A constrained optimization approach for cross-domain emotion distribution learningKnowledge-Based Systems10.1016/j.knosys.2021.107160227:COnline publication date: 5-Sep-2021
  • (2020)Characterizing communities of hashtag usage on twitter during the 2020 COVID-19 pandemic by multi-view clusteringApplied Network Science10.1007/s41109-020-00317-85:1Online publication date: 16-Sep-2020
  • (2018)Convergence analysis of gradient descent for eigenvector computationProceedings of the 27th International Joint Conference on Artificial Intelligence10.5555/3304889.3305068(2933-2939)Online publication date: 13-Jul-2018
  • (2017)A Multi-View Clustering Method for Community Discovery Integrating Links and Tags2017 IEEE 14th International Conference on e-Business Engineering (ICEBE)10.1109/ICEBE.2017.14(23-30)Online publication date: Nov-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media