Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Distributed Latent Dirichlet Allocation on Streams

Published: 20 July 2021 Publication History

Abstract

Latent Dirichlet Allocation (LDA) has been widely used for topic modeling, with applications spanning various areas such as natural language processing and information retrieval. While LDA on small and static datasets has been extensively studied, several real-world challenges are posed in practical scenarios where datasets are often huge and are gathered in a streaming fashion. As the state-of-the-art LDA algorithm on streams, Streaming Variational Bayes (SVB) introduced Bayesian updating to provide a streaming procedure. However, the utility of SVB is limited in applications since it ignored three challenges of processing real-world streams: topic evolution, data turbulence, and real-time inference. In this article, we propose a novel distributed LDA algorithm—referred to as StreamFed-LDA—to deal with challenges on streams. For topic modeling of streaming data, the ability to capture evolving topics is essential for practical online inference. To achieve this goal, StreamFed-LDA is based on a specialized framework that supports lifelong (continual) learning of evolving topics. On the other hand, data turbulence is commonly present in streams due to real-life events. In that case, the design of StreamFed-LDA allows the model to learn new characteristics from the most recent data while maintaining the historical information. On massive streaming data, it is difficult and crucial to provide real-time inference results. To increase the throughput and reduce the latency, StreamFed-LDA introduces additional techniques that substantially reduce both computation and communication costs in distributed systems. Experiments on four real-world datasets show that the proposed framework achieves significantly better performance of online inference compared with the baselines. At the same time, StreamFed-LDA also reduces the latency by orders of magnitudes in real-world datasets.

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xianqiang Zheng. 2016. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation. ACM, 265–283.
[2]
Ayan Acharya, Joydeep Ghosh, and Mingyuan Zhou. 2018. A dual Markov chain topic model for dynamic environments. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1099–1108.
[3]
Alekh Agarwal, Olivier Chapelle, Miroslav Dudík, and John Langford. 2014. A reliable effective terascale linear learning system. The Journal of Machine Learning Research 15, 1 (2014), 1111–1133.
[4]
Charu C. Aggarwal. 2013. A survey of stream clustering algorithms. In Data Clustering: Algorithms and Applications, Charu C. Aggarwal and Chandan K. Reddy (Eds.). CRC Press, 231–258.
[5]
James Allan, Ron Papka, and Victor Lavrenko. 1998. On-line new event detection and tracking. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 37–45.
[6]
Michael Anderson, Shaden Smith, Narayanan Sundaram, Mihai Capotă, Zheguang Zhao, Subramanya Dulloor, Nadathur Satish, and Theodore L. Willke. 2017. Bridging the gap between HPC and big data frameworks. In Proceedings of the VLDB Endowment.901–912.
[7]
David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. 2017. Variational inference: A review for statisticians. The Journal of the American Statistical Association 112, 518 (2017), 859–877.
[8]
David M. Blei and John D. Lafferty. 2006. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning. 113–120.
[9]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research 3 (2003), 993–1022. http://jmlr.org/papers/v3/blei03a.html
[10]
Léon Bottou, Frank E. Curtis, and Jorge Nocedal. 2016. Optimization methods for large-scale machine learning. SIAM Rev 60, 2 (2018), 223–311.
[11]
Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, and Michael I. Jordan. 2013. Streaming variational bayes. In Proceedings of the Advances in Neural Information Processing Systems. 1727–1735.
[12]
Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 36, 4 (2015), 28–38. http://sites.computer.org/debull/A15dec/p28.pdf.
[13]
Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 785–794.
[14]
Xilun Chen, K. Selçuk Candan, and Maria Luisa Sapino. 2018. Ims-dtm: Incremental multi-scale dynamic topic models. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 5078–5085.
[15]
Jonathan de Andrade Silva, Elaine R. Faria, Rodrigo C. Barros, Eduardo R. Hruschka, André Carlos Ponce de Leon Ferreira de Carvalho, and João Gama. 2013. Data stream clustering: A survey. ACM Computing Survey 46, 1 (2013), 13:1–13:31.
[16]
Matthew Hoffman, Francis R. Bach, and David M. Blei. 2010. Online learning for latent dirichlet allocation. In Proceedings of the Advances in Neural Information Processing Systems. 856–864.
[17]
Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. 2013. Stochastic variational inference. The Journal of Machine Learning Research 14, 1 (2013), 1303–1347.
[18]
Tomoharu Iwata, Takeshi Yamada, Yasushi Sakurai, and Naonori Ueda. 2012. Sequential modeling of topic dynamics with multiple timescales. ACM Transactions on Knowledge Discovery from Data 5, 4 (2012), 1–27.
[19]
Rolf Jagerman, Carsten Eickhoff, and Maarten de Rijke. 2017. Computing web-scale topic models using an asynchronous parameter server. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1337–1340.
[20]
Jon Kleinberg. 2003. Bursty and hierarchical structure in streams. Data Mining and Knowledge Discovery 7, 4 (2003), 373–397.
[21]
Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation, Broomfield, CO, .583–598.
[22]
Shangsong Liang, Emine Yilmaz, and Evangelos Kanoulas. 2016. Dynamic clustering of streaming short documents. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 995–1004.
[23]
Xiaosheng Liu, Jia Zeng, Xi Yang, Jianfeng Yan, and Qiang Yang. 2015. Scalable parallel EM algorithms for latent dirichlet allocation in multi-core systems. In Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 669–679.
[24]
James McInerney, Rajesh Ranganath, and David Blei. 2015. The population posterior and Bayesian modeling on streams. In Proceedings of the Advances in Neural Information Processing Systems. 1153–1161.
[25]
H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, Blaise Agüera y. Arcas. 2016. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS’17), Vol. 54. 1273–1282.
[26]
Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, D. B. Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zeda, Matei Zaharia, and Ameet Talwalkar. 2016. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research 17, 1 (2016), 1235–1241.
[27]
Hai-Long Nguyen, Yew-Kwong Woon, and Wee-Keong Ng. 2015. A survey on data stream clustering and classification. Knowledge and Information Systems 45, 3 (2015), 535–569.
[28]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems. 8024–8035.
[29]
Zhuolin Qiu, Bin Wu, Bai Wang, Chuan Shi, and Le Yu. 2014. Collapsed gibbs sampling for latent dirichlet allocation on spark. In Proceedings of the 3rd International Conference on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, Vol. 36. 17–28.
[30]
Rajesh Ranganath, Chong Wang, Blei David, and Eric Xing. 2013. An adaptive learning rate for stochastic variational inference. In Proceedings of the International Conference on Machine Learning. 298–306.
[31]
Alan Ritter, Oren Etzioni, and Sam Clark. 2012. Open domain event extraction from twitter. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1104–1112.
[32]
Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications 19, 1 (2005), 49–66.
[33]
Lucas Theis and Matt Hoffman. 2015. A trust-region method for stochastic variational inference with applications to streaming data. In Proceedings of the International Conference on Machine Learning. 2503–2511.
[34]
Jianyu Wang and Gauri Joshi. 2018. Adaptive communication strategies to achieve the best error-runtime trade-off in local-update SGD. In Proceedings of Machine Learning and Systems 2019 (MLSys’19), Ameet Talwalkar, Virginia Smith and Matei Zaharia (Eds.). https://proceedings.mlsys.org/book/257.pdf.
[35]
Wei Xie, Feida Zhu, Jing Jiang, Ee-Peng Lim, and Ke Wang. 2016. Topicsketch: Real-time bursty topic detection from twitter. IEEE Transactions on Knowledge and Data Engineering 28, 8 (2016), 2216–2229.
[36]
Eric P. Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. 2015. Petuum: A new platform for distributed machine learning on big data. IEEE Transactions on Big Data 1, 2 (2015), 49–67.
[37]
Hsiang-Fu Yu, Cho-Jui Hsieh, Hyokun Yun, S. V. N. Vishwanathan, and Inderjit S. Dhillon. 2015. A scalable asynchronous distributed algorithm for topic modeling. In Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1340–1350.
[38]
Jinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric Po Xing, Tie-Yan Liu, and Wei-Ying Ma. 2015. Lightlda: Big topic models on modest computer clusters. In Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1351–1361.
[39]
Lele Yut, Ce Zhang, Yingxia Shao, and Bin Cui. 2017. LDA* a robust and large-scale topic modeling system. In Proceedings of the VLDB Endowment.1406–1417.
[40]
Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. 423–438.
[41]
Manzil Zaheer, Michael Wick, Jean-Baptiste Tristan, Alex Smola, and Guy Steele. 2016. Exponential stochastic cellular automata for massively parallel inference. In Proceedings of the Artificial Intelligence and Statistics. 966–975.
[42]
Cheng Zhang, Judith Bütepage, Hedvig Kjellström, and Stephan Mandt. 2018. Advances in variational inference. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 8 (2018), 2008–2026.
[43]
Kuo Zhang, Salem Alqahtani, and Murat Demirbas. 2017. A comparison of distributed machine learning platforms. In Proceedings of the 26th International Conference on Computer Communication and Networks. 1–9.

Cited By

View all
  • (2024)Efficient topic identification for urgent MOOC Forum posts using BERTopic and traditional topic modeling techniquesEducation and Information Technologies10.1007/s10639-024-13003-4Online publication date: 17-Sep-2024
  • (2023)Different Machine Learning Algorithms used for Secure Software Advance using Software RepositoriesInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT2390225(300-317)Online publication date: 5-Apr-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 16, Issue 1
February 2022
475 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3472794
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 July 2021
Accepted: 01 February 2021
Revised: 01 January 2021
Received: 01 July 2020
Published in TKDD Volume 16, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Distributed streams
  2. learning system
  3. variational inference

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • National Natural Science Foundation of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)24
  • Downloads (Last 6 weeks)2
Reflects downloads up to 18 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Efficient topic identification for urgent MOOC Forum posts using BERTopic and traditional topic modeling techniquesEducation and Information Technologies10.1007/s10639-024-13003-4Online publication date: 17-Sep-2024
  • (2023)Different Machine Learning Algorithms used for Secure Software Advance using Software RepositoriesInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT2390225(300-317)Online publication date: 5-Apr-2023

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media