Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2487575.2487693acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Scalable text and link analysis with mixed-topic link models

Published: 11 August 2013 Publication History

Abstract

Many data sets contain rich information about objects, as well as pairwise relations between them. For instance, in networks of websites, scientific papers, and other documents, each node has content consisting of a collection of words, as well as hyperlinks or citations to other nodes. In order to perform inference on such data sets, and make predictions and recommendations, it is useful to have models that are able to capture the processes which generate the text at each node and the links between them. In this paper, we combine classic ideas in topic modeling with a variant of the mixed-membership block model recently developed in the statistical physics community. The resulting model has the advantage that its parameters, including the mixture of topics of each document and the resulting overlapping communities, can be inferred with a simple and scalable expectation-maximization algorithm. We test our model on three data sets, performing unsupervised topic classification and link prediction. For both tasks, our model outperforms several existing state-of-the-art methods, achieving higher accuracy with significantly less computation, analyzing a data set with 1.3 million words and 44 thousand links in a few minutes.

References

[1]
E. Airoldi, D. Blei, S. Fienberg, and E. Xing. Mixed membership stochastic blockmodels. J. Machine Learning Research, 9:1981--2014, 2008.
[2]
B. Ball, B. Karrer, and M. E. J. Newman. Efficient and principled method for detecting communities in networks. Phys. Rev. E, 84:036103, 2011.
[3]
S. Basu. Semi-supervised Clustering: Probabilistic Models, Algorithms and Experiments. PhD thesis, Department of Computer Sciences, University of Texas at Austin, 2005.
[4]
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. J. Machine Learning Research, 3:993--1022, 2003.
[5]
J. Chang and D. M. Blei. Relational topic models for document networks. Artificial Intelligence and Statistics, 2009.
[6]
J. Chang and D. M. Blei. Hierarchical relational models for document networks. The Annals of Applied Statistics, 4(1):124--150, Mar. 2010.
[7]
A. Chen, A. A. Amini, P. J. Bickel, and E. Levina. Fitting community models to large sparse networks. CoRR, abs/1207.2340, 2012.
[8]
A. Clauset, C. Moore, and M. E. Newman. Hierarchical structure and the prediction of missing links in networks. Nature, 453(7191):98--101, 2008.
[9]
D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In Proc. 17th Intl. Conf. on Machine Learning, pages 167--174, 2000.
[10]
D. Cohn and T. Hofmann. The missing link--a probabilistic model of document content and hypertext connectivity. Proc. 13th Neural Information Processing Systems, 2001.
[11]
A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová. Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Phys. Rev. E, 84(6), 2011.
[12]
A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová. Inference and phase transitions in the detection of modules in sparse networks. Phys. Rev. Lett., 107:065701, 2011.
[13]
E. Erosheva, S. Fienberg, and J. Lafferty. Mixed-membership models of scientific publications. Proc. National Academy of Sciences, 101 Suppl:5220--7, Apr. 2004.
[14]
S. E. Fienberg and S. Wasserman. Categorical data analysis of single sociometric relations. sociological Methodology, pages 156--192, 1981.
[15]
L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic models of relational structure. Journal of Machine Learning Research, 3:679--707, December 2002.
[16]
P. Gopalan, D. Mimno, S. Gerrish, M. Freedman, and D. Blei. Scalable inference of overlapping communities. In Advances in Neural Information Processing Systems 25, pages 2258--2266, 2012.
[17]
A. Gruber, M. Rosen-Zvi, and Y. Weiss. Latent topic models for hypertext. Proc. 24th Conf. on Uncertainty in Artificial Intelligence, 2008.
[18]
T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '99, pages 50--57, New York, NY, USA, 1999. ACM.
[19]
P. Holland, K. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. Social Networks, 5(2):109--137, 1983.
[20]
B. Karrer and M. E. J. Newman. Stochastic blockmodels and community structure in networks. Phys. Rev. E, 83:016107, 2011.
[21]
M. Kim and J. Leskovec. Latent multi-group membership graph model. CoRR, abs/1205.4546, 2012.
[22]
L. Lü and T. Zhou. Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and its Applications, 390(6):1150--1170, 2011.
[23]
Q. Lu and L. Getoor. Link-based classification. In Proceedings of the 20th Annual Intl. Conf. on Machine Learning, pages 496--503, 2003.
[24]
Q. Lu and L. Getoor. Link-based classification using labeled and unlabeled data. ICML Workshop on "The Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining, 2003.
[25]
M. Meila. Comparing clusterings by the variation of information. Learning theory and kernel machines, pages 173--187, 2003.
[26]
C. Moore, X. Yan, Y. Zhu, J. Rouquier, and T. Lane. Active learning for node classification in assortative and disassortative networks. In Proc. 17th KDD, pages 841--849, 2011.
[27]
R. M. Nallapati, A. Ahmed, E. P. Xing, and W. W. Cohen. Joint latent topic models for text and citations. Proc. 14th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '08, page 542, 2008.
[28]
M. E. J. Newman and E. A. Leicht. Mixture models and exploratory analysis in networks. Proceedings of the National Academy of Sciences of the United States of America, 104(23):9564--9, 2007.
[29]
P. Sen, G. Namata, M. Bilgic, and L. Getoor. Collective classification in network data. AI Magazine, pages 1--24, 2008.
[30]
T. Snijders and K. Nowicki. Estimation and prediction for stochastic blockmodels for graphs with latent block structure. Journal of Classification, 14(1):75--100, 1997.
[31]
C. Sun, B. Gao, Z. Cao, and H. Li. H™: a topic model for hypertexts. In Proc. Conf. on Empirical Methods in Natural Language Processing, EMNLP '08, pages 514--522, 2008.
[32]
T. Yang, R. Jin, Y. Chi, and S. Zhu. A Bayesian framework for community detection integrating content and link. In Proc. 25th Conf. on Uncertainty in Artificial Intelligence, pages 615--622, 2009.
[33]
P. Yu, J. Han, and C. Faloutsos. Link Mining: Models, Algorithms, and Applications. Springer, 2010.
[34]
Y. Zhao, E. Levina, and J. Zhu. Link prediction for partially observed networks. arXiv preprint arXiv:1301.7047, 2013.

Cited By

View all
  • (2024)Community Detection on Social Networks With Sentimental InteractionInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.34123220:1(1-23)Online publication date: 9-Apr-2024
  • (2023)The concept of decentralization through time and disciplines: a quantitative explorationEPJ Data Science10.1140/epjds/s13688-023-00418-112:1Online publication date: 3-Oct-2023
  • (2022)Community Detection in Social Networks Considering Social BehaviorsIEEE Access10.1109/ACCESS.2022.320970410(109969-109982)Online publication date: 2022
  • Show More Cited By

Index Terms

  1. Scalable text and link analysis with mixed-topic link models

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
    August 2013
    1534 pages
    ISBN:9781450321747
    DOI:10.1145/2487575
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 August 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. document classification
    2. link prediction
    3. stochastic block model
    4. topic modeling

    Qualifiers

    • Research-article

    Conference

    KDD' 13
    Sponsor:

    Acceptance Rates

    KDD '13 Paper Acceptance Rate 125 of 726 submissions, 17%;
    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Community Detection on Social Networks With Sentimental InteractionInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.34123220:1(1-23)Online publication date: 9-Apr-2024
    • (2023)The concept of decentralization through time and disciplines: a quantitative explorationEPJ Data Science10.1140/epjds/s13688-023-00418-112:1Online publication date: 3-Oct-2023
    • (2022)Community Detection in Social Networks Considering Social BehaviorsIEEE Access10.1109/ACCESS.2022.320970410(109969-109982)Online publication date: 2022
    • (2022)Node Metadata Can Produce Predictability Crossovers in Network Inference ProblemsPhysical Review X10.1103/PhysRevX.12.01101012:1Online publication date: 14-Jan-2022
    • (2021)Multilayer networks for text analysis with multiple data typesEPJ Data Science10.1140/epjds/s13688-021-00288-510:1Online publication date: 28-Jun-2021
    • (2020)Identification of Generalized Semantic Communities in Large Social NetworksIEEE Transactions on Network Science and Engineering10.1109/TNSE.2020.30085387:4(2966-2979)Online publication date: 1-Oct-2020
    • (2020)Recommender systems based on detection community in academic social network2020 International Multi-Conference on: “Organization of Knowledge and Advanced Technologies” (OCTA)10.1109/OCTA49274.2020.9151729(1-7)Online publication date: Feb-2020
    • (2020)Overlapping Community Detection in Weighted Temporal Text NetworksIEEE Access10.1109/ACCESS.2020.29814878(58118-58129)Online publication date: 2020
    • (2020)Improving topic modeling through homophily for legal documentsApplied Network Science10.1007/s41109-020-00321-y5:1Online publication date: 17-Oct-2020
    • (2020)Probabilistic reasoning system for social influence analysis in online social networksSocial Network Analysis and Mining10.1007/s13278-020-00705-z11:1Online publication date: 19-Nov-2020
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media