Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A Self-enriching Methodology for Clustering Narrow Domain Short Texts

Published: 01 July 2011 Publication History

Abstract

Clustering narrow domain short texts is considered to be a complex task because of the intrinsic features of the corpus to be clustered: (i) the low frequencies of vocabulary terms in short texts, and (ii) the high vocabulary overlapping associated to narrow domains. The aim of this paper is to introduce a self-term expansion methodology for improving the performance of clustering methods when dealing with corpora of this kind. This methodology allows raw textual data to be enriched by adding co-related terms from an automatically constructed lexical knowledge resource obtained from the same target data set (and not from an external resource). We also propose a set of supervised and unsupervised text assessment measures for evaluating different corpus features, such as shortness, stylometry and domain broadness. With the help of these measures, we may determine beforehand whether or not to use the methodology proposed in this paper. Finally, we integrate all these assessment measures in a freely available web-based system named Watermarking Corpora On-line System, which may be used by computer scientists in order to evaluate the different features associated with a given textual corpus.

Cited By

View all
  • (2024)Classifying the Mexican epidemiological semaphore colour from the Covid-19 text Spanish newsJournal of Information Science10.1177/0165551522110095250:3(568-589)Online publication date: 1-Jun-2024
  • (2018)LDA Meets Word2VecCompanion Proceedings of the The Web Conference 201810.1145/3184558.3191629(1699-1706)Online publication date: 23-Apr-2018
  • (2018)Corpus-based topic diffusion for short text clusteringNeurocomputing10.1016/j.neucom.2017.11.019275:C(2444-2458)Online publication date: 31-Jan-2018
  • Show More Cited By

Index Terms

  1. A Self-enriching Methodology for Clustering Narrow Domain Short Texts
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image The Computer Journal
    The Computer Journal  Volume 54, Issue 7
    July 2011
    227 pages

    Publisher

    Oxford University Press, Inc.

    United States

    Publication History

    Published: 01 July 2011

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 30 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Classifying the Mexican epidemiological semaphore colour from the Covid-19 text Spanish newsJournal of Information Science10.1177/0165551522110095250:3(568-589)Online publication date: 1-Jun-2024
    • (2018)LDA Meets Word2VecCompanion Proceedings of the The Web Conference 201810.1145/3184558.3191629(1699-1706)Online publication date: 23-Apr-2018
    • (2018)Corpus-based topic diffusion for short text clusteringNeurocomputing10.1016/j.neucom.2017.11.019275:C(2444-2458)Online publication date: 31-Jan-2018
    • (2017)A general framework to expand short text for topic modelingInformation Sciences: an International Journal10.5555/3062405.3062584393:C(66-81)Online publication date: 1-Jul-2017
    • (2016)Improving classification of tweets using word-word co-occurrence information from a large external corpusProceedings of the 31st Annual ACM Symposium on Applied Computing10.1145/2851613.2851986(1174-1177)Online publication date: 4-Apr-2016
    • (2014)Socio-Political Event Extraction Using a Rule-Based ApproachProceedings of the Confederated International Workshops on On the Move to Meaningful Internet Systems: OTM 2014 Workshops - Volume 884210.1007/978-3-662-45550-0_55(537-546)Online publication date: 27-Oct-2014
    • (2013)Analysis of short texts on the WebLanguage Resources and Evaluation10.1007/s10579-013-9220-947:1(123-126)Online publication date: 1-Mar-2013
    • (2013)Distributional term representations for short-text categorizationProceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 210.1007/978-3-642-37256-8_28(335-346)Online publication date: 24-Mar-2013
    • (2011)On the difficulty of clustering microblog texts for online reputation managementProceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis10.5555/2107653.2107672(146-152)Online publication date: 24-Jun-2011
    • (2011)Instance selection in text classification using the silhouette coefficient measureProceedings of the 10th Mexican international conference on Advances in Artificial Intelligence - Volume Part I10.1007/978-3-642-25324-9_31(357-369)Online publication date: 26-Nov-2011

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media