Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1099554.1099655acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

A structure-sensitive framework for text categorization

Published: 31 October 2005 Publication History

Abstract

This paper presents a framework called Structure Sensitive CATegorization(SSCAT), that exploits document structure for improved categorization. There are two parts to this framework, viz. (1) Documents often have layout structure, such that logically coherent text is grouped together into fields using some mark-up language. We use a log-linear model, which associates one or more features with each field. Weights associated with the field features are learnt from training data and these weights quantify the per-class importance of the field features in determining the category for the document. (2) We employ a technique that exploits the parse tree of fields that are phrasal constructs, such as title and associates weights with words in these constructs while boosting weights of important words called focus words. These weights are learnt from example instances of phrasal constructs, marked with the corresponding focus words. The learning is accomplished by training a classifier that uses linguistic features obtained from the text's parse structure. The weighted words, in fields with phrasal constructs, are used in obtaining features for the corresponding fields in the overall framework. SSCAT was tested on the supervised categorization task of over one million products from Yahoo!'s on-line shopping data. With an accuracy of over 90%, our classifier outperforms Naive Bayes and Support Vector Machines. This not only shows the effectiveness of SSCAT but also strengthens our belief that linguistic features based on natural language structure can improve tasks such as text categorization.

References

[1]
A. K. McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002.
[2]
G. Ramakrishnan, D. Paranjpe, and B. Dom. A structure sensitive framework for text categorizaton. http://www.cse.iitb.ac.in/ hare/papers/sscat-TR.pdf.
[3]
I. H. Witten and E. Frank. Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann, 2000.

Index Terms

  1. A structure-sensitive framework for text categorization

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management
      October 2005
      854 pages
      ISBN:1595931406
      DOI:10.1145/1099554
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 31 October 2005

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. log-linear model
      2. natural language processing
      3. text categorization

      Qualifiers

      • Article

      Conference

      CIKM05
      Sponsor:
      CIKM05: Conference on Information and Knowledge Management
      October 31 - November 5, 2005
      Bremen, Germany

      Acceptance Rates

      CIKM '05 Paper Acceptance Rate 77 of 425 submissions, 18%;
      Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

      Upcoming Conference

      CIKM '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 305
        Total Downloads
      • Downloads (Last 12 months)1
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 24 Dec 2024

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media