research-article

An architecture for parallel topic models

Editors: Elisa Bertino, Paolo Atzeni, Kian Lee Tan, Yi Chen, Y. C. Tay Authors:

Alexander Smola,

Shravan NarayanamurthyAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 3, Issue 1-2

Pages 703 - 710

https://doi.org/10.14778/1920841.1920931

Published: 01 September 2010 Publication History

Get Access

Abstract

This paper describes a high performance sampling architecture for inference of latent topic models on a cluster of workstations. Our system is faster than previous work by over an order of magnitude and it is capable of dealing with hundreds of millions of documents and thousands of topics.

The algorithm relies on a novel communication structure, namely the use of a distributed (key, value) storage for synchronizing the sampler state between computers. Our architecture entirely obviates the need for separate computation and synchronization phases. Instead, disk, CPU, and network are used simultaneously to achieve high performance. We show that this architecture is entirely general and that it can be extended easily to more sophisticated latent variable models such as n-grams and hierarchies.

References

[1]

S. Aji and R. McEliece. The generalized distributive law. IEEE IT, 46:325--343, 2000.

Digital Library

Google Scholar

[2]

A. Asuncion, P. Smyth, and M. Welling. Asynchronous distributed learning of topic models. In NIPS, pages 81--88. MIT Press, 2008.

Google Scholar

[3]

D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. JMLR, 3:993--1022, 2003.

Digital Library

Google Scholar

[4]

S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, UK, 2004.

Digital Library

Google Scholar

[5]

J. Gonzalez, Y. Low, and C. Guestrin. Residual splash for optimally parallelizing belief propagation. In AISTATS, Clearwater Beach, FL, 2009.

Google Scholar

[6]

T. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 101:5228--5235, 2004.

Crossref

Google Scholar

[7]

D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed algorithms for topic models, NIPS 2009.

Google Scholar

[8]

H. Wallach, D. Mimno, and A. McCallum. Rethinking LDA: Why priors matter. NIPS, p. 1973--1981. 2009.

Digital Library

Google Scholar

[9]

Y. Wang, H. Bai, M. Stanton, W. Chen, and E. Chang. PLDA: Parallel latent dirichlet allocation for large-scale applications. In Proc. of 5th International Conference on Algorithmic Aspects in Information and Management, 2009.

Digital Library

Google Scholar

[10]

L. Yao, D. Mimno, and A. McCallum. Efficient methods for topic model inference on streaming document collections. In KDD'09, 2009.

Digital Library

Google Scholar

Cited By

View all

Liao LShen LDuan JKolar MTao D(2024)Local AdaGrad-type algorithm for stochastic convex-concave optimizationMachine Language10.1007/s10994-022-06239-z113:4(1819-1838)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1007/s10994-022-06239-z
Zhao NJiang R(2023)DIV-DU: Data Integrity Verification and Dynamic Update of Cloud Storage in Distributed Machine LearningProceedings of the 2023 International Joint Conference on Robotics and Artificial Intelligence10.1145/3632971.3632984(8-12)Online publication date: 7-Jul-2023
https://dl.acm.org/doi/10.1145/3632971.3632984
Boehm MInterlandi MJermaine CDas SPandis ISelçuk Candan KAmer-Yahia S(2023)Optimizing Tensor Computations: From Applications to Compilation and Runtime TechniquesCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589407(53-59)Online publication date: 4-Jun-2023
https://dl.acm.org/doi/10.1145/3555041.3589407
Show More Cited By

Recommendations

Topic Models with Topic Ordering Regularities for Topic Segmentation
ICDM '14: Proceedings of the 2014 IEEE International Conference on Data Mining

Documents from the same domain usually discuss similar topics in a similar order. In this paper we present new ordering-based topic models that use generalised Mallows models to capture this regularity to constrain topic assignments. Specifically, these ...
Probabilistic topic models
KDD '11 Tutorials: Proceedings of the 17th ACM SIGKDD International Conference Tutorials

Probabilistic topic modeling provides a suite of tools for the unsupervised analysis of large collections of documents. Topic modeling algorithms can uncover the underlying themes of a collection and decompose its documents according to those themes. ...
Topic analysis for topic-focused multi-document summarization
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Topic-focused multi-document summarization has been a challenging task because the created summary is required to be biased to the given topic or query. Existing methods consider the given topic as a single coarse unit and then directly incorporate the ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 3, Issue 1-2

September 2010

1658 pages

ISSN:2150-8097

Editors:
Elisa Bertino,
Paolo Atzeni,
Kian Lee Tan,
Yi Chen,
Y. C. Tay

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 September 2010

Published in PVLDB Volume 3, Issue 1-2

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

129
Total Citations
View Citations
1,082
Total Downloads

Downloads (Last 12 months)41
Downloads (Last 6 weeks)5

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Liao LShen LDuan JKolar MTao D(2024)Local AdaGrad-type algorithm for stochastic convex-concave optimizationMachine Language10.1007/s10994-022-06239-z113:4(1819-1838)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1007/s10994-022-06239-z
Zhao NJiang R(2023)DIV-DU: Data Integrity Verification and Dynamic Update of Cloud Storage in Distributed Machine LearningProceedings of the 2023 International Joint Conference on Robotics and Artificial Intelligence10.1145/3632971.3632984(8-12)Online publication date: 7-Jul-2023
https://dl.acm.org/doi/10.1145/3632971.3632984
Boehm MInterlandi MJermaine CDas SPandis ISelçuk Candan KAmer-Yahia S(2023)Optimizing Tensor Computations: From Applications to Compilation and Runtime TechniquesCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589407(53-59)Online publication date: 4-Jun-2023
https://dl.acm.org/doi/10.1145/3555041.3589407
Renz-Wieland AGemulla RKaoudi ZMarkl VIves ZBonifati AEl Abbadi A(2022)NuPS: A Parameter Server for Machine Learning with Non-Uniform Parameter AccessProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517860(481-495)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517860
Baunsgaard SBoehm MInnerebner KKehayov MLackner FOvcharenko OPhani ARieger TWeissteiner DWrede SAl Hasan MXiong L(2022)Federated Data Preparation, Learning, and Debugging in Apache SystemDSProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557162(4813-4817)Online publication date: 17-Oct-2022
https://dl.acm.org/doi/10.1145/3511808.3557162
Churchill RSingh L(2022)The Evolution of Topic ModelingACM Computing Surveys10.1145/350790054:10s(1-35)Online publication date: 10-Nov-2022
https://dl.acm.org/doi/10.1145/3507900
Zheng MMao DYang LWei YHu Z(2022)DOSP: an optimal synchronization of parameter server for distributed machine learningThe Journal of Supercomputing10.1007/s11227-022-04422-678:12(13865-13892)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1007/s11227-022-04422-6
Xiao DLi XZhou JDu YWu W(2022)Iteration number-based hierarchical gradient aggregation for distributed deep learningThe Journal of Supercomputing10.1007/s11227-021-04083-x78:4(5565-5587)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.1007/s11227-021-04083-x
Cheng ZHuang QLee P(2022)On the performance and convergence of distributed stream processing via approximate fault toleranceThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-019-00565-w28:5(821-846)Online publication date: 11-Mar-2022
https://dl.acm.org/doi/10.1007/s00778-019-00565-w
Haddadan SZhuang YCousins CUpfal ERanzato MBeygelzimer ADauphin YLiang PVaughan J(2021)Fast doubly-adaptive MCMC to estimate the gibbs partition function with weak mixing time boundsProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3542233(25760-25772)Online publication date: 6-Dec-2021
https://dl.acm.org/doi/10.5555/3540261.3542233
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Recommendations

Topic Models with Topic Ordering Regularities for Topic Segmentation

Probabilistic topic models

Topic analysis for topic-focused multi-document summarization

Comments

Published In

Publisher

Publication History

Qualifiers

Other Metrics

Article Metrics

Other Metrics

Cited By

Login options

Full Access

PDF

eReader

Abstract

References

Cited By

Recommendations

Topic Models with Topic Ordering Regularities for Topic Segmentation

Probabilistic topic models

Topic analysis for topic-focused multi-document summarization

Comments

Information

Published In

Publisher

Publication History

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations