Article

Parameterized generation of labeled datasets for text categorization based on a hierarchical directory

Authors:

Dmitry Davidov,

Evgeniy Gabrilovich,

Shaul MarkovitchAuthors Info & Claims

SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 250 - 257

https://doi.org/10.1145/1008992.1009036

Published: 25 July 2004 Publication History

Abstract

Although text categorization is a burgeoning area of IR research, readily available test collections in this field are surprisingly scarce. We describe a methodology and system (named ACCIO) for automatically acquiring labeled datasets for text categorization from the World Wide Web, by capitalizing on the body of knowledge encoded in the structure of existing hierarchical directories such as the Open Directory. We define parameters of categories that make it possible to acquire numerous datasets with desired properties, which in turn allow better control over categorization experiments. In particular, we develop metrics that estimate the difficulty of a dataset by examining the host directory structure. These metrics are shown to be good predictors of categorization accuracy that can be achieved on a dataset, and serve as efficient heuristics for generating datasets subject to user's requirements. A large collection of automatically generated datasets are made available for other researchers to use.

References

[1]

P. N. Bennett, S. T. Dumais, and E. Horvitz. Probabilistic combination of text classifiers using reliability indicators: Models and results. In Proc. of SIGIR'02, pages 207--215, 2002.

Digital Library

[2]

C. Blake and C. Merz. UCI Repository of machine learning databases, 1998. http://www.ics.uci.edu/ mlearn/MLRepository.html.

[3]

A. Budanitsky and G. Hirst. Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. In NAACL Workshop on WordNet and Other Lexical Resources, 2001.

[4]

S. Chakrabarti, M. M. Joshi, K. Punera, and D. M. Pennock. The structure of broad topics on the web. In Proc. of the Int'l World Wide Web Conference, 2002.

Digital Library

[5]

D. Cohen, M. Herscovici, Y. Petruschka, Y. S. Maarek, A. Soffer, and D. Newbold. Personalized pocket directories for mobile devices. In Proc. of the Int'l World Wide Web Conference, 2002.

Digital Library

[6]

R. Duda and P. Hart. Pattern Classification and Scene Analysis. John Wiley and Sons, 1973.

Digital Library

[7]

S. Dumais and H. Chen. Hierarchical classification of web content. In SIGIR'00, pages 256--263, 2000.

Digital Library

[8]

S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In CIKM, pages 148--155, 1998.

Digital Library

[9]

C. Fellbaum, editor. WordNet: An Electronic Lexical Database. MIT Press, 1998.

[10]

E. Gabrilovich and S. Markovitch. Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5. To appear in ICML'04, 2004.

Digital Library

[11]

R. Ghani, R. Jones, D. Mladenic, K. Nigam, and S. Slattery. Data mining on symbolic knowledge extracted from the web. In SIGKDD Workshop on Text Mining, 2000.

[12]

D. Harman. The DARPA TIPSTER project. In SIGIR Forum, volume 26(2), pages 26--28. ACM, 1992.

Digital Library

[13]

W. Hersh, C. Buckley, T. Leone, and D. Hickam. OHSUMED: An interactive retrieval evaluation and new large test collection for research. In Proc. of SIGIR'94, pages 192--201, 1994.

Digital Library

[14]

T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In ECML'98, pages 137--142, 1998.

Digital Library

[15]

T. Joachims. Making large-scale SVM learning practical. In B. Schoelkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods -- Support Vector Learning. The MIT Press, 1999.

Digital Library

[16]

Y. Labrou and T. Finin. Yahoo! as an ontology---using Yahoo! categories to describe documents. In CIKM'99, pages 180--187, 1999.

Digital Library

[17]

W. Lam and K.-Y. Lai. A meta-learning approach for text categorization. In SIGIR'01, pages 303--309, 2001.

Digital Library

[18]

K. Lang. Newsweeder: Learning to filter netnews. In ICML'95, pages 331--339, 1995.

[19]

D. D. Lewis. Evaluating text categorization. In Proc. of the Speech and Natural Language Workshop, pages 312--318. Morgan Kaufmann, February 1991.

Digital Library

[20]

D. D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. JMLR, 5:361--397, 2004.

Digital Library

[21]

W. Meng, W. Wang, H. Sun, and C. Yu. Concept hierarchy-based text database categorization. Knowledge and Information Systems, 4:132--150, 2002.

Digital Library

[22]

Medical subject headings (MeSH). National Library of Medicine, 2003. http://www.nlm.nih.gov/mesh.

[23]

D. Mladenic and M. Grobelnik. Word sequences as features in text-learning. In Proc. of 7th Electrotech. and Comp. Sci. Conf., pages 145--148, 1998.

[24]

W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C. Cambridge University Press, second edition, 1997.

[25]

J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.

Digital Library

[26]

R. Rada and E. Bicknell. Ranking documents with a thesaurus. JASIS, 40(5):304--310, September 1989.

[27]

P. Resnik. Semantic similarity in a taxonomy. JAIR, 11:95--130, 1999.

[28]

Reuters. Reuters-21578 text categorization test collection, Distribution 1.0, 1997. http://www.daviddlewis.com/resources/testcollections/reuters21578.

[29]

J. Rowling. Harry Potter and the Goblet of Fire. Bloomsbury, 2001.

[30]

C. Santamaria, J. Gonzalo, and F. Verdejo. Automatic association of web directories to word senses. Computational Linguistics, 29(3), 2003.

Digital Library

[31]

S. Scott. Feature engineering for a symbolic approach to text classification. Master's thesis, U. Ottawa, 1998.

[32]

F. Sebastiani. Machine learning in automated text categorization. ACM Comp. Surveys, 34(1):1--47, 2002.

Digital Library

[33]

V. Vapnik. The nature of statistical learning theory. Springer-Verlag, 1995.

Digital Library

[34]

Y. Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext categorization. JIIS, 18(2/3):219--241, 2002.

Digital Library

Cited By

Zhu DNimmagadda SWong KReiners T(2023)Relevance Judgment Convergence Degree – A Measure of Inconsistency among Assessors for Information RetrievalProceedings of the 30th International Conference on Information Systems Development10.62036/ISD.2022.38Online publication date: 2023
https://doi.org/10.62036/ISD.2022.38
Kangoo NRoy A(2023)Supervised Machine Learning Text Classification: A ReviewProceedings of International Conference on Paradigms of Communication, Computing and Data Analytics10.1007/978-981-99-4626-6_53(651-661)Online publication date: 11-Oct-2023
https://doi.org/10.1007/978-981-99-4626-6_53
Zhu DNimmagadda SWong KReiners T(2023)Relevance Judgment Convergence Degree—A Measure of Assessors Inconsistency for Information Retrieval DatasetsAdvances in Information Systems Development10.1007/978-3-031-32418-5_9(149-168)Online publication date: 27-Jun-2023
https://doi.org/10.1007/978-3-031-32418-5_9
Show More Cited By

Index Terms

Parameterized generation of labeled datasets for text categorization based on a hierarchical directory
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
    2. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction

Recommendations

Synthetic Generation of High-Dimensional Datasets

Generation of synthetic datasets is a common practice in many research areas. Such data is often generated to meet specific needs or certain conditions that may not be easily found in the original, real data. The nature of the data varies according to ...
Arabic Text Categorization Based on Arabic Wikipedia

This article describes an algorithm for categorizing Arabic text, relying on highly categorized corpus-based datasets obtained from the Arabic Wikipedia by using manual and automated processes to build and customize categories. The categorization ...
Cross-lingual text categorization: Conquering language boundaries in globalized environments

Text categorization pertains to the automatic learning of a text categorization model from a training set of preclassified documents on the basis of their contents and the subsequent assignment of unclassified documents to appropriate categories. Most ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

July 2004

624 pages

ISBN:1581138814

DOI:10.1145/1008992

General Chair:
Mark Sanderson
University of Sheffield (UK)
,
Program Chairs:
Kalervo Järvelin
University of Tampere (Finland)
,
James Allan
University of Massachusetts (USA)
,
Peter Bruza
Distributed Systems Technology Centre (Australia)

Copyright © 2004 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2004

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

SIGIR04

Sponsor:

SIGIR04: The 27th ACM/SIGIR International Symposium on Information Retrieval 2004

July 25 - 29, 2004

Sheffield, United Kingdom

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

40
Total Citations
View Citations
1,114
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)0

Reflects downloads up to

Other Metrics

View Author Metrics

Citations

Cited By

Zhu DNimmagadda SWong KReiners T(2023)Relevance Judgment Convergence Degree – A Measure of Inconsistency among Assessors for Information RetrievalProceedings of the 30th International Conference on Information Systems Development10.62036/ISD.2022.38Online publication date: 2023
https://doi.org/10.62036/ISD.2022.38
Kangoo NRoy A(2023)Supervised Machine Learning Text Classification: A ReviewProceedings of International Conference on Paradigms of Communication, Computing and Data Analytics10.1007/978-981-99-4626-6_53(651-661)Online publication date: 11-Oct-2023
https://doi.org/10.1007/978-981-99-4626-6_53
Zhu DNimmagadda SWong KReiners T(2023)Relevance Judgment Convergence Degree—A Measure of Assessors Inconsistency for Information Retrieval DatasetsAdvances in Information Systems Development10.1007/978-3-031-32418-5_9(149-168)Online publication date: 27-Jun-2023
https://doi.org/10.1007/978-3-031-32418-5_9
Paul SDrineas P(2022)Deterministic Feature Selection for Regularized Least Squares ClassificationMachine Learning and Knowledge Discovery in Databases10.1007/978-3-662-44851-9_34(533-548)Online publication date: 10-Mar-2022
https://dl.acm.org/doi/10.1007/978-3-662-44851-9_34
Indyk PVakilian AYuan YWallach HLarochelle HBeygelzimer Ad'Alché-Buc FFox E(2019)Learning-based low-rank approximationsProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3454952(7402-7412)Online publication date: 8-Dec-2019
https://dl.acm.org/doi/10.5555/3454287.3454952
Chang CMishra SIgarashi T(2019)A Hierarchical Task Assignment for Manual Image Labeling2019 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC)10.1109/VLHCC.2019.8818828(139-143)Online publication date: Oct-2019
https://doi.org/10.1109/VLHCC.2019.8818828
Zhu DWong K(2017)An evaluation study on text categorization using automatically generated labeled datasetNeurocomputing10.1016/j.neucom.2016.04.072249:C(321-336)Online publication date: 2-Aug-2017
https://dl.acm.org/doi/10.1016/j.neucom.2016.04.072
Paul SMagdon-Ismail MDrineas P(2016)Feature selection for linear SVM with provable guaranteesPattern Recognition10.1016/j.patcog.2016.05.01860:C(205-214)Online publication date: 1-Dec-2016
https://dl.acm.org/doi/10.1016/j.patcog.2016.05.018
Paul SBoutsidis CMagdon-Ismail MDrineas P(2014)Random Projections for Linear Support Vector MachinesACM Transactions on Knowledge Discovery from Data10.1145/26417608:4(1-25)Online publication date: 29-Aug-2014
https://dl.acm.org/doi/10.1145/2641760
Casale PPujol ORadeva P(2014)Approximate polytope ensemble for one-class classificationPattern Recognition10.1016/j.patcog.2013.08.00747:2(854-864)Online publication date: 1-Feb-2014
https://dl.acm.org/doi/10.1016/j.patcog.2013.08.007
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents