Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing

Published: 06 May 2011 Publication History

Abstract

Previous methods of distributed Gibbs sampling for LDA run into either memory or communication bottlenecks. To improve scalability, we propose four strategies: data placement, pipeline processing, word bundling, and priority-based scheduling. Experiments show that our strategies significantly reduce the unparallelizable communication bottleneck and achieve good load balancing, and hence improve scalability of LDA.

References

[1]
Asuncion, A., Smyth, P., and Welling, M. 2008. Asynchronous distributed learning of topic models. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'08). 81--88.
[2]
Asuncion, A., Smyth, P., and Welling, M. 2010. Asynchronous distributed estimation of topic models for document analysis. Statist. Methodol. 8, 1, 3--17.
[3]
Berenbrink, P., Friedetzky, T., Hu, Z., and Martin, R. 2008. On weighted balls-into-bins games. Theor. Comput. Sci. 409, 3, 511--520.
[4]
Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993--1022.
[5]
Blinn, J. 1991. A trip down the graphics pipeline: Line clipping. IEEE Comput. Graph. Appl. 11, 1, 98--105.
[6]
Chemudugunta, C., Smyth, P., and Steyvers, M. 2007. Modeling general and specific aspects of documents with a probabilistic topic model. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'07). 241--248.
[7]
Chen, W., Chu, J., Luan, J., Bai, H., Wang, Y., and Chang, E. 2009. Collaborative filtering for orkut communities: Discovery of user latent behavior. In Proceedings of the International World Wide Web Conference (WWW'09). 681--690.
[8]
Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G., Ng, A. Y., and Olukotun, K. 2006. Mapreduce for machine learning on multicore. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'06).
[9]
Dean, J. and Ghemawat, S. 2004. Mapreduce: Simplified data processing on large clusters. In Proceedings of the ACM USENIX Symposium on Operating Systems Design and Implentation (OSDI'04). 137--150.
[10]
Gomes, R., Welling, M., and Perona, P. 2008. Memory bounded inference in topic models. In Proceedings of the International Conference on Machine Learning (ICML'08). 344--351.
[11]
Graham, S., Snir, M., and Patterson, C. 2005. Getting Up to Speed: The Future of Supercomputing. National Academies Press.
[12]
Griffiths, T. and Steyvers, M. 2004. Finding scientific topics. Proc. Nat. Acad. Sci. United States Amer. 101, 90001, 5228--5235.
[13]
Li, W. and Mccallum, A. 2006. Pachinko allocation: DAG-Structured mixture models of topic correlations. In Proceedings of the International Conference on Machine Learning (ICML'06).
[14]
Mimno, D. M. and Mccallum, A. 2007. Organizing the OCA: Learning faceted subjects from a library of digital books. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries. 376--385.
[15]
Newman, D., Asuncion, A., Smyth, P., and Welling, M. 2007. Distributed inference for latent dirichlet allocation. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'07). 1081--1088.
[16]
Newman, D., Asuncion, A., Smyth, P., and Welling, M. 2009. Distributed algorithms for topic models. J. Mach. Learn. Res. 10, 1801--1828.
[17]
Porteous, I., Newman, D., Ihler, A., Asuncion, A., Smyth, P., and Welling, M. 2008. Fast collapsed gibbs sampling for latent dirichlet allocation. In Proceedings of the International SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'08). 569--577.
[18]
Rosen-Zvi, M., Chemudugunta, C., Griffiths, T., Smyth, P., and Steyvers, M. 2010. Learning author-topic models from text corpora. ACM Trans. Inf. Syst. 28, 1, 1--38.
[19]
Shen, J. P. and Lipasti, M. H. 2005. Modern Processor Design: Fundamentals of Superscalar Processors. McGraw-Hill Higher Education.
[20]
Wang, Y., Bai, H., Stanton, M., Chen, W., and Chang, E. 2009. PLDA: Parallel latent dirichlet allocation for large-scale applications. In Algorithmic Aspects in Information and Management. 301--314.
[21]
Yan, F., Xu, N., and Qi, Y. 2009. Parallel inference for latent dirichlet allocation on graphics processing units. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'09). 2134--2142.

Cited By

View all
  • (2024)TopicRefiner: Coherence-Guided Steerable LDA for Visual Topic EnhancementIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.326689030:8(4542-4557)Online publication date: 1-Aug-2024
  • (2024)A survey on neural topic models: methods, applications, and challengesArtificial Intelligence Review10.1007/s10462-023-10661-757:2Online publication date: 25-Jan-2024
  • (2022)PGeoTopic: A Distributed Solution for Mining Geographical Topic ModelsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.298914234:2(881-893)Online publication date: 1-Feb-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology
ACM Transactions on Intelligent Systems and Technology  Volume 2, Issue 3
April 2011
259 pages
ISSN:2157-6904
EISSN:2157-6912
DOI:10.1145/1961189
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 May 2011
Accepted: 01 October 2010
Revised: 01 June 2010
Received: 01 April 2010
Published in TIST Volume 2, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Gibbs sampling
  2. Topic models
  3. distributed parallel computations
  4. latent Dirichlet allocation

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)1
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)TopicRefiner: Coherence-Guided Steerable LDA for Visual Topic EnhancementIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.326689030:8(4542-4557)Online publication date: 1-Aug-2024
  • (2024)A survey on neural topic models: methods, applications, and challengesArtificial Intelligence Review10.1007/s10462-023-10661-757:2Online publication date: 25-Jan-2024
  • (2022)PGeoTopic: A Distributed Solution for Mining Geographical Topic ModelsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.298914234:2(881-893)Online publication date: 1-Feb-2022
  • (2022)Proactive Query Expansion for Streaming Data Using External Sources2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020577(701-708)Online publication date: 17-Dec-2022
  • (2021)Collaborative Filtering Recommendation Using Nonnegative Matrix Factorization in GPU-Accelerated Spark PlatformScientific Programming10.1155/2021/88411332021Online publication date: 1-Jan-2021
  • (2021)Topic Modeling Using Latent Dirichlet allocationACM Computing Surveys10.1145/346247854:7(1-35)Online publication date: 17-Sep-2021
  • (2021)Effective Implementations of Topic Modeling AlgorithmsProgramming and Computer Software10.1134/S036176882107002147:7(483-492)Online publication date: 3-Dec-2021
  • (2021)Sys-TM: A Fast and General Topic Modeling SystemIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.295651833:6(2790-2802)Online publication date: 1-Jun-2021
  • (2021)NewsLink: Empowering Intuitive News Search with Knowledge Graphs2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00081(876-887)Online publication date: Apr-2021
  • (2021)Familia: A Configurable Topic Modeling Framework for Industrial Text EngineeringDatabase Systems for Advanced Applications10.1007/978-3-030-73200-4_36(516-528)Online publication date: 11-Apr-2021
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media