Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-3-642-37256-8_28guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Distributional term representations for short-text categorization

Published: 24 March 2013 Publication History
  • Get Citation Alerts
  • Abstract

    Everyday, millions of short-texts are generated for which effective tools for organization and retrieval are required. Because of the tiny length of these documents and of their extremely sparse representations, the direct application of standard text categorization methods is not effective. In this work we propose using distributional term representations (DTRs) for short-text categorization. DTRs represent terms by means of contextual information, given by document occurrence and term co-occurrence statistics. Therefore, they allow us to develop enriched document representations that help to overcome, to some extent, the small-length and high-sparsity issues. We report experimental results in three challenging collections, using a variety of classification methods. These results show that the use of DTRs is beneficial for improving the classification performance of classifiers in short-text categorization.

    References

    [1]
    Cabrera, J. M.: Clasificación de textos cortos usando representaciones distribucionales de los términos. Master's thesis, Instituto Nacional de Astrofísica, Óptica y Electroýica (2012)
    [2]
    Cardoso-Cachopo, A., Oliveira, A.: Combining LSI with other classifiers to improve accuracy of single-label text categorization. In: First European Workshop on Latent Semantic Analysis in Technology Enhanced Learning, Netherlands (2007)
    [3]
    Escalante, H. J., Montes, M., Sucar, E.: Multimodal indexing based on semantic cohesion for image retrieval. Information Retrieval 15(1), 1-32 (2012)
    [4]
    Faguo, Z., Fan, Z., Bingru, Y.: Research on Short Text Classification Algorithm Based on Statistics and Rules. In: Third International Symposium on Electronic Commerce and Security, pp. 3-7 (July 2010)
    [5]
    Fan, X., Hu, H.: A New Model for Chinese Short-text Classification Considering Feature Extension. In: International Conference on Artificial Intelligence and Computational Intelligence, pp. 7-11. IEEE (October 2010)
    [6]
    Garner, S. R.: Weka: The Waikato environment for knowledge analysis. In: Proceedings of the New Zealand Computer Science Research Students Conference, pp. 57-64 (1995)
    [7]
    He, F., Ding, X.-q.: Improving naive bayes text classifier using smoothing methods. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 703-707. Springer, Heidelberg (2007)
    [8]
    Ingaramo, D., Errecalde, M., Rosso, P.:A General Bio-inspired Method to Improve the Short-Text Clustering Task. In: Gelbukh, A. (ed.) CICLing 2010. LNCS, vol. 6008, pp. 661-672. Springer, Heidelberg (2010)
    [9]
    Ingaramo, D., Pinto, D., Rosso, P., Errecalde, M.: Evaluation of internal validity measures in short-text corpora. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 555-567. Springer, Heidelberg (2008)
    [10]
    Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137-142. Springer, Heidelberg (1998)
    [11]
    Lavelli, A., Sebastiani, F., Zanoli, R.: Distributional Term Representations: An Experimental Comparison. In: Italian Workshop on Advanced Database Systems (2004)
    [12]
    Lewis, D. D.: Naive Bayes at Forty: The independence assumption in information retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4-15. Springer, Heidelberg (1998)
    [13]
    Makagonov, P., Alexandrov, M., Gelbukh, A. F.: Clustering abstracts instead of full texts. In: Proceedings of the 10th International Conference on Text, Speech and Dialogue, pp. 129-136 (2004)
    [14]
    Nagarajan, M., Sheth, A., Aguilera, M., Keeton, K.: Altering Document Term Vectors for Classification - Ontologies as Expectations of Co-occurrence. In: ReCALL, pp. 1225-1226 (2007)
    [15]
    Phan, X.-H., Nguyen, C.-T., Le, D.-T., Nguyen, L.-M., Horiguchi, S., Ha, Q.-T.: A hidden topic-based framework towards building applications with short web documents. IEEE Transactions on Knowledge and Data Engineering 23(7), 961-976 (2011)
    [16]
    Phan, X.-H., Nguyen, L.-M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceeding of the 17th International Conference on World Wide Web - WWW2008, p. 91 (2008)
    [17]
    Pinto, D., Rosso, P.: On the Relative Hardness of Clustering Corpora. In: Proceedings of the 10th International Conference on Text, Speech and Dialogue, pp. 155-161 (2007)
    [18]
    Pinto, D., Rosso, P., Jimenez-Salazar, H.: A Self-enriching Methodology for Clustering Narrow Domain Short Texts. The Computer Journal, 1-18 (September 2010)
    [19]
    Pu, Q., Yang, G.-w.: Short-text classification based on ICA and LSA. In: Wang, J., Yi, Z., Żurada, J. M., Lu, B.-L., Yin, H. (eds.) ISNN 2006. LNCS, vol. 3972, pp. 265-270. Springer, Heidelberg (2006)
    [20]
    Ramírez-de-la-Rosa, G., Montes-y-Gómez, M., Solorio, T., Villaseñor-Pineda, L.: A document is known by the company it keeps: neighborhood consensus for short text categorization. Language Resources and Evaluation, 1-23 (to appear, 2013)
    [21]
    Rosas, V., Errecalde, M. L., Rosso, P.: Un Analisis Comparativo de Estrategias para la Categorizaci ón Semantica de Textos Cortos. Sociedad Española para el Procesamiento del Lenguaje Natural 44, 11-18 (2010)
    [22]
    Rosso, P., Errecalde, M., Pinto, D.: Language resources and evaluation journal: Special issue on analysis of short texts on the web (forthcoming, 2013)
    [23]
    Sahlgren, M., Cöster, R.: Using bag-of-concepts to improve the performance of support vector machines in text categorization. In: Proceedings of the 20th International Conference on Computational Linguistics, COLING 2004, pp. 1-7 (2004)
    [24]
    Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1-47 (2002)
    [25]
    Wang, J., Zhou, Y., Li, L., Hu, B., Hu, X.: Improving Short Text Clustering Performance with Keyword Expansion. In: Wang, H., Shen, Y., Huang, T., Zeng, Z. (eds.) The Sixth International Symposium on Neural Networks (ISNN 2009). AISC, vol. 56, pp. 291-298. Springer, Heidelberg (2009)
    [26]
    Xi-Wei, Y.: Feature Extension for short text. In: Proceedings of the Third International Symposium on Computer Science and Computational Technology, pp. 338-341 (2010)
    [27]
    Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1999, pp. 42-49. ACM, New York (1999)
    [28]
    Zelikovitz, S.: Transductive LSI for Short Text Classification Problems. In: American Association for Artificial Intelligence (2004)

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    CICLing'13: Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
    March 2013
    571 pages
    ISBN:9783642372551
    • Editor:
    • Alexander Gelbukh

    Sponsors

    • University of the Aegean: University of the Aegean
    • IPN: Instituto Politécnico Nacional
    • Natural Language and Text Processing Lab., CIC-IPN: Natural Language and Text Processing Laboratory, CIC-IPN

    Publisher

    Springer-Verlag

    Berlin, Heidelberg

    Publication History

    Published: 24 March 2013

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 0
      Total Downloads
    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to

    Other Metrics

    Citations

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media