[PDF][PDF] Scalable term selection for text categorization

J Li, M Sun - Proceedings of the 2007 Joint Conference on …, 2007 - aclanthology.org
J Li, M Sun
Proceedings of the 2007 Joint Conference on Empirical Methods in …, 2007aclanthology.org
In text categorization, term selection is an important step for the sake of both categorization
accuracy and computational efficiency. Different dimensionalities are expected under
different practical resource restrictions of time or space. Traditionally in text categorization,
the same scoring or ranking criterion is adopted for all target dimensionalities, which
considers both the discriminability and the coverage of a term, such as χ2 or IG. In this
paper, the poor accuracy at a low dimensionality is imputed to the small average vector …
Abstract
In text categorization, term selection is an important step for the sake of both categorization accuracy and computational efficiency. Different dimensionalities are expected under different practical resource restrictions of time or space. Traditionally in text categorization, the same scoring or ranking criterion is adopted for all target dimensionalities, which considers both the discriminability and the coverage of a term, such as χ2 or IG. In this paper, the poor accuracy at a low dimensionality is imputed to the small average vector length of the documents. Scalable term selection is proposed to optimize the term set at a given dimensionality according to an expected average vector length. Discriminability and coverage are separately measured; by adjusting the ratio of their weights in a combined criterion, the expected average vector length can be reached, which means a good compromise between the specificity and the exhaustivity of the term subset. Experiments show that the accuracy is considerably improved at lower dimensionalities, and larger term subsets have the possibility to lower the average vector length for a lower computational cost. The interesting observations might inspire further investigations.
aclanthology.org