Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2124295.2124312acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Scalable inference in latent variable models

Published: 08 February 2012 Publication History

Abstract

Latent variable techniques are pivotal in tasks ranging from predicting user click patterns and targeting ads to organizing the news and managing user generated content. Latent variable techniques like topic modeling, clustering, and subspace estimation provide substantial insight into the latent structure of complex data with little or no external guidance making them ideal for reasoning about large-scale, rapidly evolving datasets. Unfortunately, due to the data dependencies and global state introduced by latent variables and the iterative nature of latent variable inference, latent-variable techniques are often prohibitively expensive to apply to large-scale, streaming datasets.
In this paper we present a scalable parallel framework for efficient inference in latent variable models over streaming web-scale data. Our framework addresses three key challenges: 1) synchronizing the global state which includes global latent variables (e.g., cluster centers and dictionaries); 2) efficiently storing and retrieving the large local state which includes the data-points and their corresponding latent variables (e.g., cluster membership); and 3) sequentially incorporating streaming data (e.g., the news). We address these challenges by introducing: 1) a novel delta-based aggregation system with a bandwidth-efficient communication protocol; 2) schedule-aware out-of-core storage; and 3) approximate forward sampling to rapidly incorporate new data. We demonstrate state-of-the-art performance of our framework by easily tackling datasets two orders of magnitude larger than those addressed by the current state-of-the-art. Furthermore, we provide an optimized and easily customizable open-source implementation of the framework1.

References

[1]
A. Ahmed, Q. Ho, C. H. Teo, J. Eisenstein, A. J. Smola and E. P. Xing. Online Inference for The Infinite Topic-cluster model: Storylines from Streaming Text. In Artificial Intelligence and Statistics AISTATS, 2011.
[2]
A. Ahmed, Y. Low, M. Aly, V. Josifovski, and A. Smola. Scalable inference of dynamic user interests for behavioural targeting. In Knowledge Discovery and Data Mining, 2011. submitted.
[3]
A. Asuncion, P. Smyth, and M. Welling. Asynchronous distributed learning of topic models. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, NIPS, pages 81--88. MIT Press, 2008.
[4]
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, England, 2004.
[5]
J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin. Parallel coordinate descent for l1-regularized loss minimization. In International Conference on Machine Learning ICML, Bellevue, WA, 2011.
[6]
W. Chen, D. Zhang, and E. Chang. Combinational collaborative filtering for personalized community recommendation. In Y. Li, B. Liu, and S. Sarawagi, editors, Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 115--123. ACM, 2008.
[7]
A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo Methods in Practice. Springer-Verlag, 2001.
[8]
R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, 1998.
[9]
S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-6:721--741, 1984.
[10]
J. Gonzalez, Y. Low, A. Gretton, and C. Guestrin. Parallel Gibbs Sampling: From Colored Fields to ThinJunction Trees. In Artificial Intelligence and Statistics AISTATS, 2011.
[11]
T. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101:5228--5235, 2004.
[12]
D. Karger, E. Lehman, T. Leighton, M. Levine, D. Lewin, and R. Panigrahy. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In Symposium on the Theory of Computing STOC, pages 654--663, New York, May 1997. Association for Computing Machinery.
[13]
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. GraphLab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence, 2010.
[14]
M. Luby and C. Rackoff. How to construct pseudorandom permutations from pseudorandom functions. SIAM Journal on Computing, 17(2):373--386, 1988.
[15]
W. Macready, A. Siapas, and S. Kauffman. Criticality and parallelization in combinatorial optimization. Science, 271:56--59, 1996.
[16]
J. Mellor-Crummey and M. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems, 9(1):21--65, Feb. 1991.
[17]
D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed algorithms for topic models. Journal of Machine Learning Research, 10:1801--1828, 2009.
[18]
A. Smola and S. Narayanamurthy. An architecture for parallel topic models. In Very Large Databases (VLDB), 2010.
[19]
T. Iwata, T. Yamada, Y. Sakurai, and N. Ueda. Online multiscale dynamic topic models. In KDD, 2010.
[20]
N. Bartlett, D. Pfau, and F. Wood. Forgetting counts : Constant Memory inference for a dependent hierarchical Pitman-Yor Process In ICML, 2010.
[21]
E. Xing, M. Jordan, and R. Sharan. Bayesian haplotype inference via the dirichlet process. Journal of Computational Biology, 14(3):267--284, 2007.

Cited By

View all
  • (2024)Intersecting reinforcement learning and deep factor methods for optimizing locality and globality in forecastingEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.108082133:PBOnline publication date: 1-Jul-2024
  • (2023)Satellite Telemetry Parameter Prediction Based on Improved Combinatorial Machine LearningChinese Journal of Space Science10.11728/cjss2023.04.2022-005743:4(786)Online publication date: 2023
  • (2022)Deep Learning for Time Series Forecasting: Tutorial and Literature SurveyACM Computing Surveys10.1145/353338255:6(1-36)Online publication date: 7-Dec-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WSDM '12: Proceedings of the fifth ACM international conference on Web search and data mining
February 2012
792 pages
ISBN:9781450307475
DOI:10.1145/2124295
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 February 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. graphical models
  2. inference
  3. large-scale systems
  4. latent models

Qualifiers

  • Research-article

Conference

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)52
  • Downloads (Last 6 weeks)6
Reflects downloads up to 18 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Intersecting reinforcement learning and deep factor methods for optimizing locality and globality in forecastingEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.108082133:PBOnline publication date: 1-Jul-2024
  • (2023)Satellite Telemetry Parameter Prediction Based on Improved Combinatorial Machine LearningChinese Journal of Space Science10.11728/cjss2023.04.2022-005743:4(786)Online publication date: 2023
  • (2022)Deep Learning for Time Series Forecasting: Tutorial and Literature SurveyACM Computing Surveys10.1145/353338255:6(1-36)Online publication date: 7-Dec-2022
  • (2022)Simple parallel algorithms for single-site dynamicsProceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing10.1145/3519935.3519999(1431-1444)Online publication date: 9-Jun-2022
  • (2022)NuPS: A Parameter Server for Machine Learning with Non-Uniform Parameter AccessProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517860(481-495)Online publication date: 10-Jun-2022
  • (2022)Federated Learning With Heterogeneity-Aware Probabilistic Synchronous Parallel on EdgeIEEE Transactions on Services Computing10.1109/TSC.2021.310991015:2(614-626)Online publication date: 1-Mar-2022
  • (2022)PGeoTopic: A Distributed Solution for Mining Geographical Topic ModelsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.298914234:2(881-893)Online publication date: 1-Feb-2022
  • (2022)Adaptive Worker Grouping for Communication-Efficient and Straggler-Tolerant Distributed SGD2022 IEEE International Symposium on Information Theory (ISIT)10.1109/ISIT50566.2022.9834752(2996-3000)Online publication date: 26-Jun-2022
  • (2022)AFedAvg: communication-efficient federated learning aggregation with adaptive communication frequency and gradient sparseJournal of Experimental & Theoretical Artificial Intelligence10.1080/0952813X.2022.207973036:1(47-69)Online publication date: 27-May-2022
  • (2022)Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysisArtificial Intelligence Review10.1007/s10462-022-10254-w56:6(5133-5260)Online publication date: 26-Oct-2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media