Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1008992.1009036acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Parameterized generation of labeled datasets for text categorization based on a hierarchical directory

Published: 25 July 2004 Publication History
  • Get Citation Alerts
  • Abstract

    Although text categorization is a burgeoning area of IR research, readily available test collections in this field are surprisingly scarce. We describe a methodology and system (named ACCIO) for automatically acquiring labeled datasets for text categorization from the World Wide Web, by capitalizing on the body of knowledge encoded in the structure of existing hierarchical directories such as the Open Directory. We define parameters of categories that make it possible to acquire numerous datasets with desired properties, which in turn allow better control over categorization experiments. In particular, we develop metrics that estimate the difficulty of a dataset by examining the host directory structure. These metrics are shown to be good predictors of categorization accuracy that can be achieved on a dataset, and serve as efficient heuristics for generating datasets subject to user's requirements. A large collection of automatically generated datasets are made available for other researchers to use.

    References

    [1]
    P. N. Bennett, S. T. Dumais, and E. Horvitz. Probabilistic combination of text classifiers using reliability indicators: Models and results. In Proc. of SIGIR'02, pages 207--215, 2002.
    [2]
    C. Blake and C. Merz. UCI Repository of machine learning databases, 1998. http://www.ics.uci.edu/ mlearn/MLRepository.html.
    [3]
    A. Budanitsky and G. Hirst. Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. In NAACL Workshop on WordNet and Other Lexical Resources, 2001.
    [4]
    S. Chakrabarti, M. M. Joshi, K. Punera, and D. M. Pennock. The structure of broad topics on the web. In Proc. of the Int'l World Wide Web Conference, 2002.
    [5]
    D. Cohen, M. Herscovici, Y. Petruschka, Y. S. Maarek, A. Soffer, and D. Newbold. Personalized pocket directories for mobile devices. In Proc. of the Int'l World Wide Web Conference, 2002.
    [6]
    R. Duda and P. Hart. Pattern Classification and Scene Analysis. John Wiley and Sons, 1973.
    [7]
    S. Dumais and H. Chen. Hierarchical classification of web content. In SIGIR'00, pages 256--263, 2000.
    [8]
    S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In CIKM, pages 148--155, 1998.
    [9]
    C. Fellbaum, editor. WordNet: An Electronic Lexical Database. MIT Press, 1998.
    [10]
    E. Gabrilovich and S. Markovitch. Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5. To appear in ICML'04, 2004.
    [11]
    R. Ghani, R. Jones, D. Mladenic, K. Nigam, and S. Slattery. Data mining on symbolic knowledge extracted from the web. In SIGKDD Workshop on Text Mining, 2000.
    [12]
    D. Harman. The DARPA TIPSTER project. In SIGIR Forum, volume 26(2), pages 26--28. ACM, 1992.
    [13]
    W. Hersh, C. Buckley, T. Leone, and D. Hickam. OHSUMED: An interactive retrieval evaluation and new large test collection for research. In Proc. of SIGIR'94, pages 192--201, 1994.
    [14]
    T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In ECML'98, pages 137--142, 1998.
    [15]
    T. Joachims. Making large-scale SVM learning practical. In B. Schoelkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods -- Support Vector Learning. The MIT Press, 1999.
    [16]
    Y. Labrou and T. Finin. Yahoo! as an ontology---using Yahoo! categories to describe documents. In CIKM'99, pages 180--187, 1999.
    [17]
    W. Lam and K.-Y. Lai. A meta-learning approach for text categorization. In SIGIR'01, pages 303--309, 2001.
    [18]
    K. Lang. Newsweeder: Learning to filter netnews. In ICML'95, pages 331--339, 1995.
    [19]
    D. D. Lewis. Evaluating text categorization. In Proc. of the Speech and Natural Language Workshop, pages 312--318. Morgan Kaufmann, February 1991.
    [20]
    D. D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. JMLR, 5:361--397, 2004.
    [21]
    W. Meng, W. Wang, H. Sun, and C. Yu. Concept hierarchy-based text database categorization. Knowledge and Information Systems, 4:132--150, 2002.
    [22]
    Medical subject headings (MeSH). National Library of Medicine, 2003. http://www.nlm.nih.gov/mesh.
    [23]
    D. Mladenic and M. Grobelnik. Word sequences as features in text-learning. In Proc. of 7th Electrotech. and Comp. Sci. Conf., pages 145--148, 1998.
    [24]
    W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C. Cambridge University Press, second edition, 1997.
    [25]
    J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
    [26]
    R. Rada and E. Bicknell. Ranking documents with a thesaurus. JASIS, 40(5):304--310, September 1989.
    [27]
    P. Resnik. Semantic similarity in a taxonomy. JAIR, 11:95--130, 1999.
    [28]
    Reuters. Reuters-21578 text categorization test collection, Distribution 1.0, 1997. http://www.daviddlewis.com/resources/testcollections/reuters21578.
    [29]
    J. Rowling. Harry Potter and the Goblet of Fire. Bloomsbury, 2001.
    [30]
    C. Santamaria, J. Gonzalo, and F. Verdejo. Automatic association of web directories to word senses. Computational Linguistics, 29(3), 2003.
    [31]
    S. Scott. Feature engineering for a symbolic approach to text classification. Master's thesis, U. Ottawa, 1998.
    [32]
    F. Sebastiani. Machine learning in automated text categorization. ACM Comp. Surveys, 34(1):1--47, 2002.
    [33]
    V. Vapnik. The nature of statistical learning theory. Springer-Verlag, 1995.
    [34]
    Y. Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext categorization. JIIS, 18(2/3):219--241, 2002.

    Cited By

    View all
    • (2023)Relevance Judgment Convergence Degree – A Measure of Inconsistency among Assessors for Information RetrievalProceedings of the 30th International Conference on Information Systems Development10.62036/ISD.2022.38Online publication date: 2023
    • (2023)Supervised Machine Learning Text Classification: A ReviewProceedings of International Conference on Paradigms of Communication, Computing and Data Analytics10.1007/978-981-99-4626-6_53(651-661)Online publication date: 11-Oct-2023
    • (2023)Relevance Judgment Convergence Degree—A Measure of Assessors Inconsistency for Information Retrieval DatasetsAdvances in Information Systems Development10.1007/978-3-031-32418-5_9(149-168)Online publication date: 27-Jun-2023
    • Show More Cited By

    Index Terms

    1. Parameterized generation of labeled datasets for text categorization based on a hierarchical directory

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
          July 2004
          624 pages
          ISBN:1581138814
          DOI:10.1145/1008992
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Sponsors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 25 July 2004

          Permissions

          Request permissions for this article.

          Check for updates

          Qualifiers

          • Article

          Conference

          SIGIR04
          Sponsor:

          Acceptance Rates

          Overall Acceptance Rate 792 of 3,983 submissions, 20%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)11
          • Downloads (Last 6 weeks)0
          Reflects downloads up to

          Other Metrics

          Citations

          Cited By

          View all
          • (2023)Relevance Judgment Convergence Degree – A Measure of Inconsistency among Assessors for Information RetrievalProceedings of the 30th International Conference on Information Systems Development10.62036/ISD.2022.38Online publication date: 2023
          • (2023)Supervised Machine Learning Text Classification: A ReviewProceedings of International Conference on Paradigms of Communication, Computing and Data Analytics10.1007/978-981-99-4626-6_53(651-661)Online publication date: 11-Oct-2023
          • (2023)Relevance Judgment Convergence Degree—A Measure of Assessors Inconsistency for Information Retrieval DatasetsAdvances in Information Systems Development10.1007/978-3-031-32418-5_9(149-168)Online publication date: 27-Jun-2023
          • (2022)Deterministic Feature Selection for Regularized Least Squares ClassificationMachine Learning and Knowledge Discovery in Databases10.1007/978-3-662-44851-9_34(533-548)Online publication date: 10-Mar-2022
          • (2019)Learning-based low-rank approximationsProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3454952(7402-7412)Online publication date: 8-Dec-2019
          • (2019)A Hierarchical Task Assignment for Manual Image Labeling2019 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC)10.1109/VLHCC.2019.8818828(139-143)Online publication date: Oct-2019
          • (2017)An evaluation study on text categorization using automatically generated labeled datasetNeurocomputing10.1016/j.neucom.2016.04.072249:C(321-336)Online publication date: 2-Aug-2017
          • (2016)Feature selection for linear SVM with provable guaranteesPattern Recognition10.1016/j.patcog.2016.05.01860:C(205-214)Online publication date: 1-Dec-2016
          • (2014)Random Projections for Linear Support Vector MachinesACM Transactions on Knowledge Discovery from Data10.1145/26417608:4(1-25)Online publication date: 29-Aug-2014
          • (2014)Approximate polytope ensemble for one-class classificationPattern Recognition10.1016/j.patcog.2013.08.00747:2(854-864)Online publication date: 1-Feb-2014
          • Show More Cited By

          View Options

          Get Access

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media