A news classification applied with new text representation based on the improved LDA

Shao, Dangguo; Li, Chengyao; Huang, Chusheng; Xiang, Yan; Yu, Zhengtao

doi:10.1007/s11042-022-12713-6

A news classification applied with new text representation based on the improved LDA

Published: 15 March 2022

Volume 81, pages 21521–21545, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Dangguo Shao ORCID: orcid.org/0000-0001-9626-1237^1,2,
Chengyao Li¹,
Chusheng Huang¹,
Yan Xiang^1,2 &
…
Zhengtao Yu^1,2

647 Accesses
11 Citations
1 Altmetric
Explore all metrics

Abstract

Recently, news classification became an essential part of the Natural Language Processing (NLP). The traditional Latent Dirichlet Allocation (LDA) model used the generated “topic-document” matrix θ as a text representation feature to train a classifier and has achieved improved results. However, some text information will be missed using only the “topic-document” matrix θ as the text feature. In addition, the Gibbs sampling iteration number of the traditional LDA model must be set in advance, which affects the algorithm’s speed. In this paper, the traditional LDA model is improved in two phases. In the first phase, a method to determine the convergence of the parameter search process is proposed. An adaptive iterative method is used with the proposed method. In the second phase, a new text representation (C_new) obtained by multiplying the “topic-document” matrix θ and the “word-topic” matrix φ is provided. In the evaluation results, the proposed method is tested using the news corpus in the field of metallurgy, and the THU Chinese News (THUCNews) corpus provided by the Natural Language Processing Laboratory of Tsinghua University. The proposed method proved its efficiency in improving the classification accuracy and reducing the number of iterations for the Gibbs sampling compared with the traditional LDA.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Text Classification Based on Topic Modeling and Chi-square

Improving Multi-label Document Classification of Czech News Articles

Natural Language Contents Evaluation System for Multi-class News Categorization Using Machine Learning and Transformers

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Akash S, Charles S (2017) Autoen-coding variational inference for topic models. arXiv preprint arXiv:1703.01488
Blei DM et al (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Burkhardt S, Kramer S (2018) Online multi-label dependency topic models for text classification. Mach Learn 107(3):859–886
Article MathSciNet Google Scholar
Cecchini D, Na L (2018) Chinese news classification. 681-684.
Chair-Chickering P, Max PC-H, Joseph (2004) Proceedings of the twentieth conference on uncertainty in artificial intelligence (2004). AUAI Press, Conference on Uncertainty in Artificial Intelligence
Chen K, Zhang Z, Long J, … Zhang H (2016) Turning from tf-idf to tf-igm for term weighting in text classification. Expert Syst Appl Int J 66(Dec.):245–260
Article Google Scholar
Cheng J, Dong L, Lapata M (2016) Long short-term memory-networks for machine Reading. Proceedings of the 2016 conference on empirical methods in natural language processing.
Deerwester S, Dumais ST, Furnas GW et al (2010) Indexing by latent semantic analysis. J Assoc Inform ence Technol 41(6):391–407
Google Scholar
Feng G, Guo J, Jing BY et al (2015) Feature subset selection using naive bayes for text classification. Patt Recogn Lett, 65(NOV.1), 109-115.
Gao J, Li M, Huang CN, … Wu A (2005) Chinese word segmentation and named entity recognition: a pragmatic approach. Comput lingus 31(4):531–574
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Huang R et al (2018) Classification of settlement types from tweets using LDA and LSTM. IGARSS 2018:6408–6411
Google Scholar
Jiang L, Li C, Wang S, … Zhang L (2016) Deep feature weighting for naive bayes and its application to text classification. Eng Appl Artif Intell 52(Jun.):26–39
Article Google Scholar
Jia-Ni HU, Wei-Ran XU, Jun G et al (2005) Study on feature selection methods in chinese text categorization Study On Communications
Jun L, Dongsheng Z, Xinlai X, Yinghao L (2012) Keyword extraction and headline generation using novel word features. J Comput Appl 29(11):4224–4227
Google Scholar
Kumar M, Kaur RP, Jindal MK (2020) Newspaper text recognition of gurumukhi script using random forest classifier. Multimed Tools Appl 79(5):7435–7448
Google Scholar
Li G, Lin Z, Wang H, … Wei X (2020) A discriminative approach to sentiment classification. Neural Process Lett 51(1):749–758
Article Google Scholar
Liu G, Guo J (2019) Bidirectional lstm with attention mechanism and convolutional layer for text classification. Neurocomputing, 337(APR.14), 325-338.
Liu J, Wu F, Wu C et al (2019) Neural chinese word segmentation with dictionary. Neurocomputing, 338(APR.21), 46-54.
Liu Y, Pang B (2019) Opinion spam detection based on annotation extension and neural networks. Comput Inform ence 12(2):87–102
Google Scholar
Luo LX (2019) Network text sentiment analysis method combining lda text representation and gru-cnn. Pers Ubiquit Comput 23:405–412
Article Google Scholar
Nan F, Ding R et al (2019) Topic modeling with wasserstein autoencoders. In proceedings of the 57th annual meeting of the Association for Computational Linguistics, 6345-6381.
Park ST, Liu C (2020) A study on topic models using lda and word2vec in travel route recommendation: focus on convergence travel and tours reviews. Pers Ubiquit Comput:1–17
Rajeswari S, Suthendran K (2019) C5.0: advanced decision tree (adt) classification model for agricultural data analysis on cloud. Comput Electron Agric 156:530–539
Article Google Scholar
Rani R, Lobiyal DK (2020) An extractive text summarization approach using tagged-lda based topic modeling. Multimedia Tools Appl, 1-31.
Ravenzwaaij DV, Cassey P, Brown SD (2018) A simple introduction to markov chain Monte–carlo sampling. Springer Open Choice 25(1):143–154
Google Scholar
Shen Y, He X, Gao J, et al. (2014) A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval[J], 101–110
Shoukun X, Jia Z, Ning L, Lin S (2018) Text topics based on Word2Vec and LDA. Comput Eng Des 39(9):2764–2769
Google Scholar
Sun M, Li J, Guo Z, Yu Z, Zheng Y, Si X (2016) Zhiyuan Liu. THUCTC, An Efficient Chinese Text Classifier
Varatharajan R, Manogaran G, Priyan MK (2017) A big data classification approach using lda with an enhanced svm method for ecg signals in cloud computing. Multimed Tools Appl 77:10195–10215
Article Google Scholar
Wang Q, Peng R, Wang J, et al. (2020) Research on Text Classification Method of LDA- SVM Based on PSO optimization. 2019 Chinese automation congress (CAC). IEEE. 1974–1978.
Wermter S (2000) Neural network agents for learning semantic text classification. Inf Retr 3(2):87–103
Article Google Scholar
Xiangdong L, Zhichao B, Li H (2015) A text feature selection method based on weighted LDA model and multi-grain. Library Inf Technol 31(5):42–49
Google Scholar
Xu AJ et al (2020) Incorporating context-relevant concepts into convolutional neural networks for short text classification. Neurocomputing 386:42–53
Article Google Scholar
Xu G, Wu X, Yao H et al (2019) Research on topic recognition of network sensitive information based on sw-lda model. IEEE access, 1-1.
Yangsen Z, Jia Z, Yuru J et al (2019) A text sentiment classification modeling method based on coordinated cnn-lstm-attention model. Chin J Electron 28(01):124–130
Google Scholar
Yu S, Liu D, Zhu W et al (2020) Attention-based lstm, gru and cnn for short text classification. J Intell Fuzzy Syst 1-8
Zhang X, Song Q (2014) Predicting the number of nearest neighbors for the k-nn classification algorithm. Intell Data Anal 18(3):449–464
Article Google Scholar
Zhao W, Guan Z, Chen L, … Wang Q (2018) Weakly-supervised deep embedding for product review sentiment analysis. IEEE Trans Knowl Data Eng 30(1):185–197
Article Google Scholar
Zhen Y, Kefeng F, Yingxu L et al (2014) Short texts classification through reference document expansion. Chin J Electron 02:315–321
Google Scholar
Zhiyong Q, Jianzhong Z, Guoping T et al (2014) Research on automatic word segmentation and pos tagging for chu ci based on hmm Library and Information Service
Zhou C, Sun C, Liu Z et al (2015) A c-lstm neural network for text classification. Comput ence 1(4):39–44
Google Scholar
Zhou Y, Xu B, Xu J et al (2017) Compositional recurrent neural networks for Chinese short text classification. ACM, IEEE/WIC/ACM International Conference on Web Intelligence

Download references

Acknowledgements

This work was supported by the Postdoctoral Science Foundation of China (2016M592894XB), the Nature and Science Foundation of China (61741112) and the Nature and Science Foundation of Yunnan Province (2017FB098). We also appreciated the valuable comments from the other members in our department.

Author information

Authors and Affiliations

Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, 650500, China
Dangguo Shao, Chengyao Li, Chusheng Huang, Yan Xiang & Zhengtao Yu
Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming, 650500, China
Dangguo Shao, Yan Xiang & Zhengtao Yu

Authors

Dangguo Shao
View author publications
You can also search for this author in PubMed Google Scholar
Chengyao Li
View author publications
You can also search for this author in PubMed Google Scholar
Chusheng Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yan Xiang
View author publications
You can also search for this author in PubMed Google Scholar
Zhengtao Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dangguo Shao.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shao, D., Li, C., Huang, C. et al. A news classification applied with new text representation based on the improved LDA. Multimed Tools Appl 81, 21521–21545 (2022). https://doi.org/10.1007/s11042-022-12713-6

Download citation

Received: 09 December 2020
Revised: 09 March 2021
Accepted: 21 February 2022
Published: 15 March 2022
Issue Date: June 2022
DOI: https://doi.org/10.1007/s11042-022-12713-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A news classification applied with new text representation based on the improved LDA

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Text Classification Based on Topic Modeling and Chi-square

Improving Multi-label Document Classification of Czech News Articles

Natural Language Contents Evaluation System for Multi-class News Categorization Using Machine Learning and Transformers

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A news classification applied with new text representation based on the improved LDA

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Text Classification Based on Topic Modeling and Chi-square

Improving Multi-label Document Classification of Czech News Articles

Natural Language Contents Evaluation System for Multi-class News Categorization Using Machine Learning and Transformers

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation