Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1401890.1401923acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Scaling up text classification for large file systems

Published: 24 August 2008 Publication History
  • Get Citation Alerts
  • Abstract

    We combine the speed and scalability of information retrieval with the generally superior classification accuracy offered by machine learning, yielding a two-phase text classifier that can scale to very large document corpora. We investigate the effect of different methods of formulating the query from the training set, as well as varying the query size. In empirical tests on the Reuters RCV1 corpus of 806,000 documents, we find runtime was easily reduced by a factor of 27x, with a somewhat surprising gain in F-measure compared with traditional text classification.

    References

    [1]
    Anagnostopoulos, A., Broder, A. Z., and Punera, K. 2006. Effective and efficient classification on a search-engine model. In Proc. of the 15th ACM International Conference on Information and Knowledge Management (Arlington, VA, Nov. 6-11, 2006). CIKM '06. ACM, 208--217.
    [2]
    Broder, A. Z., Carmel, D., Herscovici, M., Soffer, A., and Zien, J. 2003. Efficient query evaluation using a two-level retrieval process. In Proc. of the Twelfth Int'l Conference on information and Knowledge Management (New Orleans, LA, Nov. 03 - 08, 2003). CIKM '03. ACM, 426--434.
    [3]
    Douceur, J. R. and Bolosky, W. J. 1999. A large-scale study of file-system contents. SIGMETRICS Perform. Eval. Rev. 27, 1 (Jun. 1999), 59--70.
    [4]
    Forman, G. 2006. Quantifying trends accurately despite classifier error and class imbalance. In Proc. of the 12th ACM Int'l Conf. on Knowledge Discovery and Data Mining (Philadelphia, Aug. 20-23, 2006). KDD'06. ACM, 157--166.
    [5]
    Forman, G. 2003. An extensive empirical study of feature selection metrics for text classification. J. Machine Learning Research. 3 (Mar. 2003), 1289--1305.
    [6]
    Hatcher, E. and Gospodnetic, O. 2004 Lucene in Action (In Action Series). Manning Publications Co.
    [7]
    Joachims, T. 2006. Training linear SVMs in linear time. In Proceedings of the 12th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining (Philadelphia, PA, Aug. 20-23, 2006). KDD '06. ACM, 217--226.
    [8]
    Kittler, J., Hatef, M., Duin, R. P. W., and Matas, J. 1998. On combining classifiers. IEEE Trans. On Pattern Analysis and Machine Intelligence, vol.20, no.3, Mar. 1998.
    [9]
    Lewis, D. D.; Yang, Y.; Rose, T.; and Li, F. 2004. RCV1: a new benchmark collection for text categorization research. J. Machine Learning Research, 5:361--397.
    [10]
    Luo, H. 2005. Optimization design of cascaded classifiers. In Proc. of the 2005 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition (CVPR'05) - Vol. 1 (June 20-26, 2005). IEEE Computer Society, 480--485.
    [11]
    Van Hulse, J., Khoshgoftaar, T. M., and Napolitano, A. 2007. Experimental perspectives on learning from imbalanced data. In Proceedings of the 24th International Conference on Machine Learning (Corvalis, Oregon, June 20 - 24, 2007). ICML '07, vol. 227. ACM, 935--942.
    [12]
    Viola, P. and Jones, M. J. 2002. Robust real-time object detection. International Journal of Computer Vision.
    [13]
    Witten, I. and Frank, E. 2005. Data Mining: Practical machine learning tools and techniques, 2nd Edition, Morgan Kaufmann, San Francisco.

    Cited By

    View all
    • (2012)Supervised and semi-supervised learning in text classification using enhanced KNN algorithmInternational Journal of Intelligent Systems Technologies and Applications10.5555/2448056.244805811:3/4(179-195)Online publication date: 1-Mar-2012
    • (2012)Automate back office activity monitoring to drive operational excellenceProceedings of the 10th international conference on Service-Oriented Computing10.1007/978-3-642-34321-6_55(688-702)Online publication date: 12-Nov-2012
    • (2009)Greedy is not EnoughProceedings of the 2009 IEEE International Conference on Data Mining Workshops10.1109/ICDMW.2009.38(326-331)Online publication date: 6-Dec-2009
    • Show More Cited By

    Recommendations

    Reviews

    Tania Bedrax-Weiss

    Machine learning has been applied in the domain of information retrieval with great success, as witnessed in Web search engines such as Google. In such domains, the number of search results is so large that the objective is to obtain high precision in the top search results. However, there are smaller document repositories where the objective is both high precision and high recall. In the larger of these repositories, machine learning can still take a long time to find the search results, so optimization techniques are necessary. This paper explains one such optimization technique that relies on narrowing the search space to provide significant speedups. More specifically, Forman and Rajaram formulate a two-phase classification, where the first phase selects a small subset of documents that are relevant, by using the index to query the repository, and the second classifies the documents retrieved. The authors describe in detail the different design choices made for each phase and how some of the design choices perform with respect to others. They show that the two-phase approach yields significant savings over the full classification, and conclude that the savings are more pronounced in more difficult cases-large queries. The intuition is that the two-phase classifier performs a balancing act between query time, and fetching the document and analyzing it-extracting features and classifying. Although this paper provides an interesting read as it dives into text classification for information retrieval, there are optimizations that this document does not address. For example, if the corpus does not change often, one could use offline classification for documents that essentially do not change or change very little. Still, Forman and Rajaram discuss possible extensions of their approach that seem promising, and anyone interested in this subject area should feel encouraged to explore. Online Computing Reviews Service

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
    August 2008
    1116 pages
    ISBN:9781605581934
    DOI:10.1145/1401890
    • General Chair:
    • Ying Li,
    • Program Chairs:
    • Bing Liu,
    • Sunita Sarawagi
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 August 2008

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. document categorization
    2. enterprise scalability
    3. forensic search
    4. information retrieval
    5. machine learning
    6. text classification

    Qualifiers

    • Research-article

    Conference

    KDD08

    Acceptance Rates

    KDD '08 Paper Acceptance Rate 118 of 593 submissions, 20%;
    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '24

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)6
    • Downloads (Last 6 weeks)0

    Other Metrics

    Citations

    Cited By

    View all
    • (2012)Supervised and semi-supervised learning in text classification using enhanced KNN algorithmInternational Journal of Intelligent Systems Technologies and Applications10.5555/2448056.244805811:3/4(179-195)Online publication date: 1-Mar-2012
    • (2012)Automate back office activity monitoring to drive operational excellenceProceedings of the 10th international conference on Service-Oriented Computing10.1007/978-3-642-34321-6_55(688-702)Online publication date: 12-Nov-2012
    • (2009)Greedy is not EnoughProceedings of the 2009 IEEE International Conference on Data Mining Workshops10.1109/ICDMW.2009.38(326-331)Online publication date: 6-Dec-2009
    • (2008)Extremely fast text feature extraction for classification and indexingProceedings of the 17th ACM conference on Information and knowledge management10.1145/1458082.1458243(1221-1230)Online publication date: 26-Oct-2008

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media