Abstract
Recently, news classification became an essential part of the Natural Language Processing (NLP). The traditional Latent Dirichlet Allocation (LDA) model used the generated “topic-document” matrix θ as a text representation feature to train a classifier and has achieved improved results. However, some text information will be missed using only the “topic-document” matrix θ as the text feature. In addition, the Gibbs sampling iteration number of the traditional LDA model must be set in advance, which affects the algorithm’s speed. In this paper, the traditional LDA model is improved in two phases. In the first phase, a method to determine the convergence of the parameter search process is proposed. An adaptive iterative method is used with the proposed method. In the second phase, a new text representation (Cnew) obtained by multiplying the “topic-document” matrix θ and the “word-topic” matrix φ is provided. In the evaluation results, the proposed method is tested using the news corpus in the field of metallurgy, and the THU Chinese News (THUCNews) corpus provided by the Natural Language Processing Laboratory of Tsinghua University. The proposed method proved its efficiency in improving the classification accuracy and reducing the number of iterations for the Gibbs sampling compared with the traditional LDA.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Akash S, Charles S (2017) Autoen-coding variational inference for topic models. arXiv preprint arXiv:1703.01488
Blei DM et al (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Burkhardt S, Kramer S (2018) Online multi-label dependency topic models for text classification. Mach Learn 107(3):859–886
Cecchini D, Na L (2018) Chinese news classification. 681-684.
Chair-Chickering P, Max PC-H, Joseph (2004) Proceedings of the twentieth conference on uncertainty in artificial intelligence (2004). AUAI Press, Conference on Uncertainty in Artificial Intelligence
Chen K, Zhang Z, Long J, … Zhang H (2016) Turning from tf-idf to tf-igm for term weighting in text classification. Expert Syst Appl Int J 66(Dec.):245–260
Cheng J, Dong L, Lapata M (2016) Long short-term memory-networks for machine Reading. Proceedings of the 2016 conference on empirical methods in natural language processing.
Deerwester S, Dumais ST, Furnas GW et al (2010) Indexing by latent semantic analysis. J Assoc Inform ence Technol 41(6):391–407
Feng G, Guo J, Jing BY et al (2015) Feature subset selection using naive bayes for text classification. Patt Recogn Lett, 65(NOV.1), 109-115.
Gao J, Li M, Huang CN, … Wu A (2005) Chinese word segmentation and named entity recognition: a pragmatic approach. Comput lingus 31(4):531–574
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Huang R et al (2018) Classification of settlement types from tweets using LDA and LSTM. IGARSS 2018:6408–6411
Jiang L, Li C, Wang S, … Zhang L (2016) Deep feature weighting for naive bayes and its application to text classification. Eng Appl Artif Intell 52(Jun.):26–39
Jia-Ni HU, Wei-Ran XU, Jun G et al (2005) Study on feature selection methods in chinese text categorization Study On Communications
Jun L, Dongsheng Z, Xinlai X, Yinghao L (2012) Keyword extraction and headline generation using novel word features. J Comput Appl 29(11):4224–4227
Kumar M, Kaur RP, Jindal MK (2020) Newspaper text recognition of gurumukhi script using random forest classifier. Multimed Tools Appl 79(5):7435–7448
Li G, Lin Z, Wang H, … Wei X (2020) A discriminative approach to sentiment classification. Neural Process Lett 51(1):749–758
Liu G, Guo J (2019) Bidirectional lstm with attention mechanism and convolutional layer for text classification. Neurocomputing, 337(APR.14), 325-338.
Liu J, Wu F, Wu C et al (2019) Neural chinese word segmentation with dictionary. Neurocomputing, 338(APR.21), 46-54.
Liu Y, Pang B (2019) Opinion spam detection based on annotation extension and neural networks. Comput Inform ence 12(2):87–102
Luo LX (2019) Network text sentiment analysis method combining lda text representation and gru-cnn. Pers Ubiquit Comput 23:405–412
Nan F, Ding R et al (2019) Topic modeling with wasserstein autoencoders. In proceedings of the 57th annual meeting of the Association for Computational Linguistics, 6345-6381.
Park ST, Liu C (2020) A study on topic models using lda and word2vec in travel route recommendation: focus on convergence travel and tours reviews. Pers Ubiquit Comput:1–17
Rajeswari S, Suthendran K (2019) C5.0: advanced decision tree (adt) classification model for agricultural data analysis on cloud. Comput Electron Agric 156:530–539
Rani R, Lobiyal DK (2020) An extractive text summarization approach using tagged-lda based topic modeling. Multimedia Tools Appl, 1-31.
Ravenzwaaij DV, Cassey P, Brown SD (2018) A simple introduction to markov chain Monte–carlo sampling. Springer Open Choice 25(1):143–154
Shen Y, He X, Gao J, et al. (2014) A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval[J], 101–110
Shoukun X, Jia Z, Ning L, Lin S (2018) Text topics based on Word2Vec and LDA. Comput Eng Des 39(9):2764–2769
Sun M, Li J, Guo Z, Yu Z, Zheng Y, Si X (2016) Zhiyuan Liu. THUCTC, An Efficient Chinese Text Classifier
Varatharajan R, Manogaran G, Priyan MK (2017) A big data classification approach using lda with an enhanced svm method for ecg signals in cloud computing. Multimed Tools Appl 77:10195–10215
Wang Q, Peng R, Wang J, et al. (2020) Research on Text Classification Method of LDA- SVM Based on PSO optimization. 2019 Chinese automation congress (CAC). IEEE. 1974–1978.
Wermter S (2000) Neural network agents for learning semantic text classification. Inf Retr 3(2):87–103
Xiangdong L, Zhichao B, Li H (2015) A text feature selection method based on weighted LDA model and multi-grain. Library Inf Technol 31(5):42–49
Xu AJ et al (2020) Incorporating context-relevant concepts into convolutional neural networks for short text classification. Neurocomputing 386:42–53
Xu G, Wu X, Yao H et al (2019) Research on topic recognition of network sensitive information based on sw-lda model. IEEE access, 1-1.
Yangsen Z, Jia Z, Yuru J et al (2019) A text sentiment classification modeling method based on coordinated cnn-lstm-attention model. Chin J Electron 28(01):124–130
Yu S, Liu D, Zhu W et al (2020) Attention-based lstm, gru and cnn for short text classification. J Intell Fuzzy Syst 1-8
Zhang X, Song Q (2014) Predicting the number of nearest neighbors for the k-nn classification algorithm. Intell Data Anal 18(3):449–464
Zhao W, Guan Z, Chen L, … Wang Q (2018) Weakly-supervised deep embedding for product review sentiment analysis. IEEE Trans Knowl Data Eng 30(1):185–197
Zhen Y, Kefeng F, Yingxu L et al (2014) Short texts classification through reference document expansion. Chin J Electron 02:315–321
Zhiyong Q, Jianzhong Z, Guoping T et al (2014) Research on automatic word segmentation and pos tagging for chu ci based on hmm Library and Information Service
Zhou C, Sun C, Liu Z et al (2015) A c-lstm neural network for text classification. Comput ence 1(4):39–44
Zhou Y, Xu B, Xu J et al (2017) Compositional recurrent neural networks for Chinese short text classification. ACM, IEEE/WIC/ACM International Conference on Web Intelligence
Acknowledgements
This work was supported by the Postdoctoral Science Foundation of China (2016M592894XB), the Nature and Science Foundation of China (61741112) and the Nature and Science Foundation of Yunnan Province (2017FB098). We also appreciated the valuable comments from the other members in our department.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Shao, D., Li, C., Huang, C. et al. A news classification applied with new text representation based on the improved LDA. Multimed Tools Appl 81, 21521–21545 (2022). https://doi.org/10.1007/s11042-022-12713-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-12713-6