Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1835449.1835502acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Temporally-aware algorithms for document classification

Published: 19 July 2010 Publication History

Abstract

Automatic Document Classification (ADC) is still one of the major information retrieval problems. It usually employs a supervised learning strategy, where we first build a classification model using pre-classified documents and then use this model to classify unseen documents. The majority of supervised algorithms consider that all documents provide equally important information. However, in practice, a document may be considered more or less important to build the classification model according to several factors, such as its timeliness, the venue where it was published in, its authors, among others. In this paper, we are particularly concerned with the impact that temporal effects may have on ADC and how to minimize such impact. In order to deal with these effects, we introduce a temporal weighting function (TWF) and propose a methodology to determine it for document collections. We applied the proposed methodology to ACM-DL and Medline and found that the TWF of both follows a lognormal. We then extend three ADC algorithms (namely kNN, Rocchio and Naïve Bayes) to incorporate the TWF. Experiments showed that the temporally-aware classifiers achieved significant gains, outperforming (or at least matching) state-of-the-art algorithms.

References

[1]
L. Breiman and P. Spector. Submodel selection and evaluation in regression - the x-random case. International Statistical Review, 60:291--319, 1992.
[2]
N. H. M. Caldwell, P. J. Clarkson, P. A. Rodgers, and A. P. Huxor. Web-based knowledge management for distributed design. IEEE Intelligent Systems, 15(3):40--47, 2000.
[3]
D. B. Clarkson, Y.-a. Fan, and H. Joe. A remark on algorithm 643: Fexact: an algorithm for performing fisher's exact test in r x c %contingency tables. ACM Trans. Math. Softw., 19(4):484--488, 1993.
[4]
W. W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. ACM Trans. Inf. Syst., 17(2):141--173, 1999.
[5]
S. K. Crow EL. Log-normal distributions: Theory and application. New York: Dekker, December 1988.
[6]
P. E. D'Agostino R.B. Tests for departure from normality. Biometrika, 60:613--622, 1973.
[7]
G. Folino, C. Pizzuti, and G. Spezzano. An adaptive distributed ensemble approach to mine concept-drifting data streams. In ICTAI '07, Volume 2, pages 183--188, Washington, DC, USA, 2007. IEEE Computer Society.
[8]
T. Joachims. Making large-scale support vector machine learning practical. Advances in kernel methods: support vector learning, pages 169--184, 1999.
[9]
T. Joachims. Training linear svms in linear time. In Proc. of the 12th ACM SIGKDD Conference, pages 217--226, New York, NY, USA, 2006. ACM.
[10]
Y. S. Kim, S. S. Park, E. Deards, and B. H. Kang. Adaptive web document classification with mcrdr. In ITCC '04, Volume 2, page 476, Washington, DC, USA, 2004. IEEE Computer Society.
[11]
R. Klinkenberg. Learning drifting concepts: Example selection vs. example weighting. Intell. Data Anal., 8(3):281--300, 2004.
[12]
R. Klinkenberg and T. Joachims. Detecting concept drift with support vector machines. In P. Langley, editor, ICML '00, pages 487--494, Stanford, US, 2000. Morgan Kaufmann Publishers, San Francisco, US.
[13]
J. Kolter and M. Maloof. Dynamic weighted majority: A new ensemble method for tracking concept drift. Technical Report CSTR-20030610-3, Department of Computer Science, Georgetown University, Washington, DC, June 2003.
[14]
S. Lawrence and C. L. Giles. Context and page analysis for improved web search. IEEE Internet Computing, 2(4), 1998.
[15]
M. M. Lazarescu, S. Venkatesh, and H. H. Bui. Using multiple windows to track concept drift. Intell. Data Anal., 8(1):29--59, 2004.
[16]
E. Limpert, W. A. Stahel, and M. Abbt. Log-normal distributions across the sciences: Keys and clues. BioScience, 51(5):341--352, 2001.
[17]
R. Liu and Y. Lu. Incremental context mining for adaptive document classification. In Proc. of the 8th ACM SIGKDD, pages 599--604. ACM Press, 2002.
[18]
C. D. Manning, P. Raghavan, and H. Schtze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.
[19]
F. Mourao, L. Rocha, R. Araújo, T. Couto, M. Gonçalves, and W. Meira Jr. Understanding temporal aspects in document classification. In Proc. of the WSDM '08, 2008.
[20]
L. Rocha, F. Mourão, A. Pereira, M. A. Gonçalves, and W. Meira Jr. Exploiting temporal contexts in text classification. In Proc. of the CIKM '08, 2008.
[21]
M. Scholz and R. Klinkenberg. Boosting classifiers for drifting concepts. Intell. Data Anal., 11(1):3--28, 2007.
[22]
A. Tsymbal. The problem of concept drift: Definitions and related work. Technical report, Department of Computer Science, Trinity College, Dublin, Ireland, December 2004.
[23]
G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden contexts. Machine Learning, 23(1):69--101, 1996.

Cited By

View all
  • (2024)A New Natural Language Processing–Inspired Methodology (Detection, Initial Characterization, and Semantic Characterization) to Investigate Temporal Shifts (Drifts) in Health Care Data: Quantitative StudyJMIR Medical Informatics10.2196/5424612(e54246)Online publication date: 28-Oct-2024
  • (2024)A network-driven study of hyperprolific authors in computer scienceScientometrics10.1007/s11192-024-04940-5129:4(2255-2283)Online publication date: 1-Apr-2024
  • (2021)RETRACTED ARTICLE: A swarm-optimized tree-based association rule approach for classifying semi-structured data using soft computing approachSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-021-06158-625:20(12745-12758)Online publication date: 1-Oct-2021
  • Show More Cited By

Index Terms

  1. Temporally-aware algorithms for document classification

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
      July 2010
      944 pages
      ISBN:9781450301534
      DOI:10.1145/1835449
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 19 July 2010

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. classification and clustering
      2. text mining

      Qualifiers

      • Research-article

      Conference

      SIGIR '10
      Sponsor:

      Acceptance Rates

      SIGIR '10 Paper Acceptance Rate 87 of 520 submissions, 17%;
      Overall Acceptance Rate 792 of 3,983 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)9
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 20 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)A New Natural Language Processing–Inspired Methodology (Detection, Initial Characterization, and Semantic Characterization) to Investigate Temporal Shifts (Drifts) in Health Care Data: Quantitative StudyJMIR Medical Informatics10.2196/5424612(e54246)Online publication date: 28-Oct-2024
      • (2024)A network-driven study of hyperprolific authors in computer scienceScientometrics10.1007/s11192-024-04940-5129:4(2255-2283)Online publication date: 1-Apr-2024
      • (2021)RETRACTED ARTICLE: A swarm-optimized tree-based association rule approach for classifying semi-structured data using soft computing approachSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-021-06158-625:20(12745-12758)Online publication date: 1-Oct-2021
      • (2017)A multi-one-class dynamic classifier for adaptive digitization of document streamsInternational Journal on Document Analysis and Recognition10.1007/s10032-017-0286-620:3(137-154)Online publication date: 1-Sep-2017
      • (2016)Machine learning approach to recognize subject based sentiment values of reviews2016 Moratuwa Engineering Research Conference (MERCon)10.1109/MERCon.2016.7480107(6-11)Online publication date: Apr-2016
      • (2016)Smoothing Temporal Difference for Text CategorizationInformation Retrieval Technology10.1007/978-3-319-28940-3_16(203-214)Online publication date: 22-Jan-2016
      • (2016)A quantitative analysis of the temporal effects on automatic text classificationJournal of the Association for Information Science and Technology10.1002/asi.2345267:7(1639-1667)Online publication date: 1-Jul-2016
      • (2015)G-KNNProceedings of the 30th Annual ACM Symposium on Applied Computing10.1145/2695664.2695967(1335-1338)Online publication date: 13-Apr-2015
      • (2014)Mining text and social streamsACM SIGKDD Explorations Newsletter10.1145/2641190.264119415:2(9-19)Online publication date: 16-Jun-2014
      • (2014)Learning temporal-dependent ranking modelsProceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval10.1145/2600428.2609619(757-766)Online publication date: 3-Jul-2014
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media