research-article

Scaling up text classification for large file systems

Authors:

George Forman and

Shyamsundar RajaramAuthors Info & Claims

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

August 2008

Pages 239 - 246

https://doi.org/10.1145/1401890.1401923

Published: 24 August 2008 Publication History

Get Access

Abstract

We combine the speed and scalability of information retrieval with the generally superior classification accuracy offered by machine learning, yielding a two-phase text classifier that can scale to very large document corpora. We investigate the effect of different methods of formulating the query from the training set, as well as varying the query size. In empirical tests on the Reuters RCV1 corpus of 806,000 documents, we find runtime was easily reduced by a factor of 27x, with a somewhat surprising gain in F-measure compared with traditional text classification.

References

[1]

Anagnostopoulos, A., Broder, A. Z., and Punera, K. 2006. Effective and efficient classification on a search-engine model. In Proc. of the 15th ACM International Conference on Information and Knowledge Management (Arlington, VA, Nov. 6-11, 2006). CIKM '06. ACM, 208--217.

Digital Library

Google Scholar

[2]

Broder, A. Z., Carmel, D., Herscovici, M., Soffer, A., and Zien, J. 2003. Efficient query evaluation using a two-level retrieval process. In Proc. of the Twelfth Int'l Conference on information and Knowledge Management (New Orleans, LA, Nov. 03 - 08, 2003). CIKM '03. ACM, 426--434.

Digital Library

Google Scholar

[3]

Douceur, J. R. and Bolosky, W. J. 1999. A large-scale study of file-system contents. SIGMETRICS Perform. Eval. Rev. 27, 1 (Jun. 1999), 59--70.

Digital Library

Google Scholar

[4]

Forman, G. 2006. Quantifying trends accurately despite classifier error and class imbalance. In Proc. of the 12th ACM Int'l Conf. on Knowledge Discovery and Data Mining (Philadelphia, Aug. 20-23, 2006). KDD'06. ACM, 157--166.

Digital Library

Google Scholar

[5]

Forman, G. 2003. An extensive empirical study of feature selection metrics for text classification. J. Machine Learning Research. 3 (Mar. 2003), 1289--1305.

Digital Library

Google Scholar

[6]

Hatcher, E. and Gospodnetic, O. 2004 Lucene in Action (In Action Series). Manning Publications Co.

Digital Library

Google Scholar

[7]

Joachims, T. 2006. Training linear SVMs in linear time. In Proceedings of the 12th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining (Philadelphia, PA, Aug. 20-23, 2006). KDD '06. ACM, 217--226.

Digital Library

Google Scholar

[8]

Kittler, J., Hatef, M., Duin, R. P. W., and Matas, J. 1998. On combining classifiers. IEEE Trans. On Pattern Analysis and Machine Intelligence, vol.20, no.3, Mar. 1998.

Digital Library

Google Scholar

[9]

Lewis, D. D.; Yang, Y.; Rose, T.; and Li, F. 2004. RCV1: a new benchmark collection for text categorization research. J. Machine Learning Research, 5:361--397.

Digital Library

Google Scholar

[10]

Luo, H. 2005. Optimization design of cascaded classifiers. In Proc. of the 2005 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition (CVPR'05) - Vol. 1 (June 20-26, 2005). IEEE Computer Society, 480--485.

Digital Library

Google Scholar

[11]

Van Hulse, J., Khoshgoftaar, T. M., and Napolitano, A. 2007. Experimental perspectives on learning from imbalanced data. In Proceedings of the 24th International Conference on Machine Learning (Corvalis, Oregon, June 20 - 24, 2007). ICML '07, vol. 227. ACM, 935--942.

Digital Library

Google Scholar

[12]

Viola, P. and Jones, M. J. 2002. Robust real-time object detection. International Journal of Computer Vision.

Google Scholar

[13]

Witten, I. and Frank, E. 2005. Data Mining: Practical machine learning tools and techniques, 2nd Edition, Morgan Kaufmann, San Francisco.

Digital Library

Google Scholar

Cited By

View all

Wajeed MAdilakshmi T(2012)Supervised and semi-supervised learning in text classification using enhanced KNN algorithmInternational Journal of Intelligent Systems Technologies and Applications10.5555/2448056.244805811:3/4(179-195)Online publication date: 1-Mar-2012
https://dl.acm.org/doi/10.5555/2448056.2448058
He MQin TZeng SRen CYuan L(2012)Automate back office activity monitoring to drive operational excellenceProceedings of the 10th international conference on Service-Oriented Computing10.1007/978-3-642-34321-6_55(688-702)Online publication date: 12-Nov-2012
https://dl.acm.org/doi/10.1007/978-3-642-34321-6_55
Xu ZHogan CBauer R(2009)Greedy is not EnoughProceedings of the 2009 IEEE International Conference on Data Mining Workshops10.1109/ICDMW.2009.38(326-331)Online publication date: 6-Dec-2009
https://dl.acm.org/doi/10.1109/ICDMW.2009.38
Show More Cited By

Index Terms

Scaling up text classification for large file systems
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information retrieval

Recommendations

Automatic Text Classification in Information retrieval: A Survey
ICTCS '16: Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies

Improvement in information retrieval performance relates to the accessibility, selection and management of large amounts of information on web that usually expressed as textual data and supervised machine learning approach is an important source of tool ...
Read More
Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values

Each type of classifier has its own advantages as well as certain shortcomings. In this paper, we take the advantages of the associative classifier and the Naive Bayes Classifier to make up the shortcomings of each other, thus improving the accuracy of ...
Read More
Urdu text classification
FIT '09: Proceedings of the 7th International Conference on Frontiers of Information Technology

This paper compares statistical techniques for text classification using Naïve Bayes and Support Vector Machines, in context of Urdu language. A large corpus is used for training and testing purpose of the classifiers. However, those classifiers cannot ...
Read More

Reviews

Reviewer: Tania Bedrax-Weiss

Machine learning has been applied in the domain of information retrieval with great success, as witnessed in Web search engines such as Google. In such domains, the number of search results is so large that the objective is to obtain high precision in the top search results. However, there are smaller document repositories where the objective is both high precision and high recall. In the larger of these repositories, machine learning can still take a long time to find the search results, so optimization techniques are necessary. This paper explains one such optimization technique that relies on narrowing the search space to provide significant speedups. More specifically, Forman and Rajaram formulate a two-phase classification, where the first phase selects a small subset of documents that are relevant, by using the index to query the repository, and the second classifies the documents retrieved. The authors describe in detail the different design choices made for each phase and how some of the design choices perform with respect to others. They show that the two-phase approach yields significant savings over the full classification, and conclude that the savings are more pronounced in more difficult cases-large queries. The intuition is that the two-phase classifier performs a balancing act between query time, and fetching the document and analyzing it-extracting features and classifying. Although this paper provides an interesting read as it dives into text classification for information retrieval, there are optimizations that this document does not address. For example, if the corpus does not change often, one could use offline classification for documents that essentially do not change or change very little. Still, Forman and Rajaram discuss possible extensions of their approach that seem promising, and anyone interested in this subject area should feel encouraged to explore. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

August 2008

1116 pages

ISBN:9781605581934

DOI:10.1145/1401890

General Chair:
Ying Li
Microsoft adCenter Labs
,
Program Chairs:
Bing Liu
University of Illinois at Chicago
,
Sunita Sarawagi
Indian Institute of Technology, Bombay

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD08

Sponsor:

KDD08: The 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 24 - 27, 2008

Nevada, Las Vegas, USA

Acceptance Rates

KDD '08 Paper Acceptance Rate 118 of 593 submissions, 20%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '24

Sponsor:
sigkdd
sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
712
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Other Metrics

View Author Metrics

Citations

Cited By

View all

Wajeed MAdilakshmi T(2012)Supervised and semi-supervised learning in text classification using enhanced KNN algorithmInternational Journal of Intelligent Systems Technologies and Applications10.5555/2448056.244805811:3/4(179-195)Online publication date: 1-Mar-2012
https://dl.acm.org/doi/10.5555/2448056.2448058
He MQin TZeng SRen CYuan L(2012)Automate back office activity monitoring to drive operational excellenceProceedings of the 10th international conference on Service-Oriented Computing10.1007/978-3-642-34321-6_55(688-702)Online publication date: 12-Nov-2012
https://dl.acm.org/doi/10.1007/978-3-642-34321-6_55
Xu ZHogan CBauer R(2009)Greedy is not EnoughProceedings of the 2009 IEEE International Conference on Data Mining Workshops10.1109/ICDMW.2009.38(326-331)Online publication date: 6-Dec-2009
https://dl.acm.org/doi/10.1109/ICDMW.2009.38
Forman GKirshenbaum EShanahan JAmer-Yahia SManolescu IZhang YEvans DKolcz AChoi KChowdury A(2008)Extremely fast text feature extraction for classification and indexingProceedings of the 17th ACM conference on Information and knowledge management10.1145/1458082.1458243(1221-1230)Online publication date: 26-Oct-2008
https://dl.acm.org/doi/10.1145/1458082.1458243

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Automatic Text Classification in Information retrieval: A Survey

Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values

Urdu text classification

Reviews

Access critical reviews of Computing literature here

Comments

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Other Metrics

Article Metrics

Other Metrics

Cited By

Login options

Full Access

PDF

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Automatic Text Classification in Information retrieval: A Survey

Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values

Urdu text classification

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations