Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2187980.2188238acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
tutorial

Large scale microblog mining using distributed MB-LDA

Published: 16 April 2012 Publication History

Abstract

In the information explosion era, large scale data processing and mining is a hot issue. As microblog grows more popular, microblog services have become information provider on a web scale, so researches on microblog begin to focus more on its content mining than solely user's relationship analysis before. Although traditional text mining methods have been studied well, no algorithm is designed specially for microblog data, which contain structured information on social network besides plain text. In this paper, we introduce a novel probabilistic generative model MicroBlog-Latent Dirichlet Allocation (MB-LDA), which takes both contactor relevance relation and document relevance relation into consideration to improve topic mining in microblogs. Through Gibbs sampling for approximate inference of our model, MB-LDA can discover not only the topics of microblogs, but also the topics focused by contactors. When faced with large datasets, traditional techniques on single node become less practical within limited resources. So we present distributed MB-LDA in MapReduce framework in order to process large scale microblogs with high scalability. Furthermore, we apply a performance model to optimize the execution time by tuning the number of mappers and reducers. Experimental results on actual dataset show MB-LDA outperforms the baseline of LDA and distributed MB-LDA offers an effective solution to topic mining for large scale microblogs.

References

[1]
J. H. Kang, K. Lerman, A. Plangprasopchok. Analyzing Microblogs with Affinity Propagation. In Proceedings of the 1st KDD workshop on Social Media Analytic, 2010: 67--70
[2]
A. Java, X. Song, T. Finin, et al. Why we Twitter: Understanding Microblogging Usage and Communities. In Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis (WebKDD/SNA-KDD) 2007:56--65
[3]
B. Krishnamurthy, P. Gill, M. Arlitt. A few chirps about Twitter. In Proceedings of the first workshop on online social networks (WOSP), 2008:19--24
[4]
D. Ramage, S. Dumais, D. Liebling. Characterizing microblogs with topic models. In Proceedings of International AAAI Conference on Weblogs and Social Media, 2010: 130--137
[5]
J. Dean, S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. 2008, 51(1): 107--113
[6]
R. Xu, D. Wunsch. Survey of clustering algorithms. IEEE Transactions on Neural Networks, 2005, 16(3): 645--678
[7]
S. Deerwester, S. Dumais, T. Landauer, et al. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 1990, 41(6): 391--407
[8]
G. Salton, M. McGill. Introduction to Modern Information Retrieval. New York: McGraw-Hill, 1983
[9]
T. K. Landauer, P. W. Foltz, D. Laham. Introduction to Latent Semantic Analysis. Discourse Processes, 1998, 25: 259--284
[10]
D. M. Blei, A. Y. Ng, M. I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 2003, 3: 993--1022
[11]
T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual International ACM SIGIR Conference on Research and development in information retrieval, 1999: 50--57
[12]
T. Griffiths, M. Steyvers. Probabilistic topic models. Latent Semantic Analysis: A Road to Meaning. Hillsdale, NJ: Laurence Erlbaum, 2006
[13]
X. Wei and W. B. Croft. LDA-based document models for ad hoc retrieval. In Proceedings of the 29th annual International ACM SIGIR Conference on Research and development in information retrieval, 2006: 178--185
[14]
L. Dietz, S. Bickel, T. Scheffer. Unsupervised prediction of citation influences. In Proceedings of the 24th International Conference on Machine learning, 2007: 233--240
[15]
QiaoZhu Mei, Deng Cai, Duo Zhang, et al. Topic Modeling with Network Regularization. In Proceedings of the 17th International Conference on World Wide Web. 2008
[16]
D. M. Blei, J. Lafferty. Topic models. Text Mining: Classification, Clustering, and Applications. New York: Chapman & Hall/CRC, 2009
[17]
R. Nallapati, W. Cohen. Link-pLSA-LDA: A new unsupervised model for topics and influence of blogs. In Proceedings of the International Conference on Weblogs and Social Media (ICWSM), 2008
[18]
Congkai Sun, Bin Gao, Zhenfu Cao, et al. HTM: A topic model for hypertexts. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, 2008: 514--522
[19]
D. Newman, A. Asuncion, P. Smyth, et al. Distributed Algorithms for Topic Models. Journal of Machine Learning Research. 2009, 1801--1828.
[20]
A. Asuncion, P. Smyth, M. Welling. Asynchronous distributed learning of topic models. In Proceedings of the 20th Neural Information Processing Systems (NIPS). 2008
[21]
Yi Wang, Hongjie Bai, M. Stanton, et al. PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications. In Proceedings of the 5th International Conference on Algorithmic Aspects in Information and Management (AAIM '09), 2009, 301--314.
[22]
A. Smola, S. Narayanamurthy. An architecture for parallel topic models. Proceedings of VLDB Endow.2010, 3, 1--2, 703--710.
[23]
T. L. Griffiths, M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101:5228--5235,
[24]
T. P. Minka, J. Lafferty. Expectation-propagation for the generative aspect model. In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, 2002: 352--359
[25]
Xiao Yang, Jianling Sun. An Analytical Performance Model of MapReduce. In Proceedings of CCIS: 306 -- 310, 2011.

Cited By

View all
  • (2024)Weibo-FA: A Benchmark Dataset for Fake Account Detection in Weibo PlatformWeb Information Systems Engineering – WISE 202410.1007/978-981-96-0576-7_13(166-176)Online publication date: 27-Nov-2024
  • (2023)Popularity Prediction of Micro-Blog Hot Topics Based on Time-Series Data2023 15th International Conference on Software, Knowledge, Information Management and Applications (SKIMA)10.1109/SKIMA59232.2023.10387341(199-204)Online publication date: 8-Dec-2023
  • (2020)Extracting and tracking hot topics of micro-blogs based on improved Latent Dirichlet AllocationEngineering Applications of Artificial Intelligence10.1016/j.engappai.2019.10327987(103279)Online publication date: Jan-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide Web
April 2012
1250 pages
ISBN:9781450312301
DOI:10.1145/2187980
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • Univ. de Lyon: Universite de Lyon

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 April 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. distributed MB-LDA
  2. large-scale
  3. mapreduce
  4. microblogs
  5. social network

Qualifiers

  • Tutorial

Conference

WWW 2012
Sponsor:
  • Univ. de Lyon
WWW 2012: 21st World Wide Web Conference 2012
April 16 - 20, 2012
Lyon, France

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 23 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Weibo-FA: A Benchmark Dataset for Fake Account Detection in Weibo PlatformWeb Information Systems Engineering – WISE 202410.1007/978-981-96-0576-7_13(166-176)Online publication date: 27-Nov-2024
  • (2023)Popularity Prediction of Micro-Blog Hot Topics Based on Time-Series Data2023 15th International Conference on Software, Knowledge, Information Management and Applications (SKIMA)10.1109/SKIMA59232.2023.10387341(199-204)Online publication date: 8-Dec-2023
  • (2020)Extracting and tracking hot topics of micro-blogs based on improved Latent Dirichlet AllocationEngineering Applications of Artificial Intelligence10.1016/j.engappai.2019.10327987(103279)Online publication date: Jan-2020
  • (2018)Dynamic topic modeling via self-aggregation for short text streamsPeer-to-Peer Networking and Applications10.1007/s12083-018-0692-7Online publication date: 14-Nov-2018
  • (2018)Tracking Topic Trends for Short TextsKnowledge Graph and Semantic Computing. Language, Knowledge, and Intelligence10.1007/978-981-10-7359-5_12(117-128)Online publication date: 20-Jan-2018
  • (2018)HDP-TUB Based Topic Mining Method for Chinese Micro-blogsNatural Language Processing and Chinese Computing10.1007/978-3-319-73618-1_75(856-865)Online publication date: 5-Jan-2018
  • (2017)Revealing Learner Interests through Topic Mining from Question-Answering DataInternational Journal of Distance Education Technologies10.4018/IJDET.201704010215:2(18-32)Online publication date: 1-Apr-2017
  • (2017)Discovery and classification of user interests on social mediaInformation Discovery and Delivery10.1108/IDD-03-2017-002345:3(130-138)Online publication date: 21-Aug-2017
  • (2017)Chapter 14 Semantic Search of Online Reviews on E-Business PlatformsInternet+ and Electronic Business in China: Innovation and Applications10.1108/978-1-78743-115-720171016(461-482)Online publication date: 21-Dec-2017
  • (2016)Personality as a metric for topic models on social networksJournal of High Speed Networks10.3233/JHS-16054022:2(169-176)Online publication date: 30-Mar-2016
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media