Abstract
The explosive growth of web 2.0 applications (e.g., social networks, question answering forums and blogs) leads to continuous generation of short texts. Using clustering analysis to automatically categorize the stream short texts has been proved to be one of the critical unsupervised learning techniques. However, the unique attributes of short texts (e.g, few meaningful keywords, noisy features and lacking context) and the temporal dynamics of data in the stream challenge this task.
To tackle the problem, in this paper, we propose a stream clustering algorithm EWNStream by exploring the Evolutionary Word relation Network. The word relation network is constructed with the aggregated word co-occurrence patterns from batch of short texts in the stream to overcome the sparse features of short text at document level. To cope with the temporal dynamics of data in the stream, the word relation network will be incrementally updated with the new arriving batches of data. The change of word relation network indicates the evolution of underlying clusters in the stream. Based on the evolutionary word relation network, we proposed a keyword group discovery strategy to extract the representative terms for the underlying short text clusters. The keyword groups are used as cluster centers to group the stream short texts. The experimental results on real-word Twitter dataset show that our method can achieve much better clustering accuracy and time efficiency.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Silva, J.A., Faria, E.R., Barros, R.C., Hruschka, E.R., De Carvalho, A.C., Gama, J.: Data stream clustering: a survey. ACM Comput. Surv. (CSUR) 46(1), 13 (2013)
Ozcan, G.: Unsupervised learning from multi-dimensional data: a fast clustering algorithm utilizing canopies and statistical information. Int. J. Inf. Technol. Decis. Making 17(03), 841–856 (2018)
Mehdizadeh, E., Teimouri, M., Zaretalab, A., Niaki, S.: A combined approach based on K-means and modified electromagnetism-like mechanism for data clustering. Int. J. Inf. Technol. Decis. Making 16(05), 1279–1307 (2017)
Feng, W., et al.: STREAMCUBE: hierarchical spatio-temporal hashtag clustering for event exploration over the twitter stream. In: 2015 IEEE 31st International Conference on Data Engineering (ICDE), pp. 1561–1572. IEEE (2015)
Zhao, Y., Liang, S., Ren, Z., Ma, J., Yilmaz, E., de Rijke, M.: Explainable user clustering in short text streams. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 155–164. ACM (2016)
Wang, N., Ke, S., Chen, Y., Yan, T., Lim, A., et al.: Textual sentiment of Chinese microblog toward the stock market. Int. J. Inf. Technol. Decis. Making (IJITDM) 18(02), 649–671 (2019)
Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456. ACM (2013)
Huang, G., et al.: Mining streams of short text for analysis of world-wide event evolutions. World Wide Web 18(5), 1201–1217 (2015)
Shou, L., Wang, Z., Chen, K., Chen, G.: Sumblr: continuous summarization of evolving tweet streams. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 533–542. ACM (2013)
Liang, S., Yilmaz, E., Kanoulas, E.: Dynamic clustering of streaming short documents. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016, pp. 995–1004. ACM (2016)
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on Very Large Data Bases-Volume 29, pp. 81–92. VLDB Endowment (2003)
Cao, F., Estert, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM International Conference on Data Mining, pp. 328–339. SIAM (2006)
Zhong, S.: Efficient streaming text clustering. Neural Netw. 18(5–6), 790–798 (2005)
Blei, D.M., Lafferty, J.D.: Dynamic topic models. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 113–120. ACM (2006)
Yin, J., Chao, D., Liu, Z., Zhang, W., Yu, X., Wang, J.: Model-based clustering of short text streams. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2018, pp. 2634–2642. ACM, New York (2018)
Liu, K., Bellet, A., Sha, F.: Similarity learning for high-dimensional sparse data. In: AISTATS (2015)
Yang, J., Leskovec, J.: Patterns of temporal variation in online media. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM 2011, pp. 177–186. ACM, New York (2011)
Yin, J., Wang, J.: A model-based approach for text clustering with outlier detection. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 625–636. IEEE (2016)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Acknowledgments
This work was partially supported by Australian Research Council (ARC) Grant (No. DE140100387).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Yang, S., Huang, G., Zhou, X., Xiang, Y. (2020). Dynamic Clustering of Stream Short Documents Using Evolutionary Word Relation Network. In: He, J., et al. Data Science. ICDS 2019. Communications in Computer and Information Science, vol 1179. Springer, Singapore. https://doi.org/10.1007/978-981-15-2810-1_40
Download citation
DOI: https://doi.org/10.1007/978-981-15-2810-1_40
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-2809-5
Online ISBN: 978-981-15-2810-1
eBook Packages: Computer ScienceComputer Science (R0)