A news classification applied with new text representation based on the improved LDA

D Shao, C Li, C Huang, Y Xiang, Z Yu - Multimedia tools and applications, 2022 - Springer
D Shao, C Li, C Huang, Y Xiang, Z Yu
Multimedia tools and applications, 2022Springer
Recently, news classification became an essential part of the Natural Language Processing
(NLP). The traditional Latent Dirichlet Allocation (LDA) model used the generated “topic-
document” matrix θ as a text representation feature to train a classifier and has achieved
improved results. However, some text information will be missed using only the “topic-
document” matrix θ as the text feature. In addition, the Gibbs sampling iteration number of
the traditional LDA model must be set in advance, which affects the algorithm's speed. In this …
Abstract
Recently, news classification became an essential part of the Natural Language Processing (NLP). The traditional Latent Dirichlet Allocation (LDA) model used the generated “topic-document” matrix θ as a text representation feature to train a classifier and has achieved improved results. However, some text information will be missed using only the “topic-document” matrix θ as the text feature. In addition, the Gibbs sampling iteration number of the traditional LDA model must be set in advance, which affects the algorithm’s speed. In this paper, the traditional LDA model is improved in two phases. In the first phase, a method to determine the convergence of the parameter search process is proposed. An adaptive iterative method is used with the proposed method. In the second phase, a new text representation (Cnew) obtained by multiplying the “topic-document” matrix θ and the “word-topic” matrix φ is provided. In the evaluation results, the proposed method is tested using the news corpus in the field of metallurgy, and the THU Chinese News (THUCNews) corpus provided by the Natural Language Processing Laboratory of Tsinghua University. The proposed method proved its efficiency in improving the classification accuracy and reducing the number of iterations for the Gibbs sampling compared with the traditional LDA.
Springer