research-article

Temporally-aware algorithms for document classification

Authors:

Leonardo Rocha,

Gisele L. Pappa,

Fernando Mourão,

Wagner Meira, Jr.,

Marcos GonçalvesAuthors Info & Claims

SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Pages 307 - 314

https://doi.org/10.1145/1835449.1835502

Published: 19 July 2010 Publication History

Abstract

Automatic Document Classification (ADC) is still one of the major information retrieval problems. It usually employs a supervised learning strategy, where we first build a classification model using pre-classified documents and then use this model to classify unseen documents. The majority of supervised algorithms consider that all documents provide equally important information. However, in practice, a document may be considered more or less important to build the classification model according to several factors, such as its timeliness, the venue where it was published in, its authors, among others. In this paper, we are particularly concerned with the impact that temporal effects may have on ADC and how to minimize such impact. In order to deal with these effects, we introduce a temporal weighting function (TWF) and propose a methodology to determine it for document collections. We applied the proposed methodology to ACM-DL and Medline and found that the TWF of both follows a lognormal. We then extend three ADC algorithms (namely kNN, Rocchio and Naïve Bayes) to incorporate the TWF. Experiments showed that the temporally-aware classifiers achieved significant gains, outperforming (or at least matching) state-of-the-art algorithms.

References

[1]

L. Breiman and P. Spector. Submodel selection and evaluation in regression - the x-random case. International Statistical Review, 60:291--319, 1992.

[2]

N. H. M. Caldwell, P. J. Clarkson, P. A. Rodgers, and A. P. Huxor. Web-based knowledge management for distributed design. IEEE Intelligent Systems, 15(3):40--47, 2000.

Digital Library

[3]

D. B. Clarkson, Y.-a. Fan, and H. Joe. A remark on algorithm 643: Fexact: an algorithm for performing fisher's exact test in r x c %contingency tables. ACM Trans. Math. Softw., 19(4):484--488, 1993.

Digital Library

[4]

W. W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. ACM Trans. Inf. Syst., 17(2):141--173, 1999.

Digital Library

[5]

S. K. Crow EL. Log-normal distributions: Theory and application. New York: Dekker, December 1988.

[6]

P. E. D'Agostino R.B. Tests for departure from normality. Biometrika, 60:613--622, 1973.

[7]

G. Folino, C. Pizzuti, and G. Spezzano. An adaptive distributed ensemble approach to mine concept-drifting data streams. In ICTAI '07, Volume 2, pages 183--188, Washington, DC, USA, 2007. IEEE Computer Society.

Digital Library

[8]

T. Joachims. Making large-scale support vector machine learning practical. Advances in kernel methods: support vector learning, pages 169--184, 1999.

Digital Library

[9]

T. Joachims. Training linear svms in linear time. In Proc. of the 12th ACM SIGKDD Conference, pages 217--226, New York, NY, USA, 2006. ACM.

Digital Library

[10]

Y. S. Kim, S. S. Park, E. Deards, and B. H. Kang. Adaptive web document classification with mcrdr. In ITCC '04, Volume 2, page 476, Washington, DC, USA, 2004. IEEE Computer Society.

Digital Library

[11]

R. Klinkenberg. Learning drifting concepts: Example selection vs. example weighting. Intell. Data Anal., 8(3):281--300, 2004.

[12]

R. Klinkenberg and T. Joachims. Detecting concept drift with support vector machines. In P. Langley, editor, ICML '00, pages 487--494, Stanford, US, 2000. Morgan Kaufmann Publishers, San Francisco, US.

Digital Library

[13]

J. Kolter and M. Maloof. Dynamic weighted majority: A new ensemble method for tracking concept drift. Technical Report CSTR-20030610-3, Department of Computer Science, Georgetown University, Washington, DC, June 2003.

[14]

S. Lawrence and C. L. Giles. Context and page analysis for improved web search. IEEE Internet Computing, 2(4), 1998.

Digital Library

[15]

M. M. Lazarescu, S. Venkatesh, and H. H. Bui. Using multiple windows to track concept drift. Intell. Data Anal., 8(1):29--59, 2004.

[16]

E. Limpert, W. A. Stahel, and M. Abbt. Log-normal distributions across the sciences: Keys and clues. BioScience, 51(5):341--352, 2001.

[17]

R. Liu and Y. Lu. Incremental context mining for adaptive document classification. In Proc. of the 8th ACM SIGKDD, pages 599--604. ACM Press, 2002.

Digital Library

[18]

C. D. Manning, P. Raghavan, and H. Schtze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.

[19]

F. Mourao, L. Rocha, R. Araújo, T. Couto, M. Gonçalves, and W. Meira Jr. Understanding temporal aspects in document classification. In Proc. of the WSDM '08, 2008.

Digital Library

[20]

L. Rocha, F. Mourão, A. Pereira, M. A. Gonçalves, and W. Meira Jr. Exploiting temporal contexts in text classification. In Proc. of the CIKM '08, 2008.

Digital Library

[21]

M. Scholz and R. Klinkenberg. Boosting classifiers for drifting concepts. Intell. Data Anal., 11(1):3--28, 2007.

[22]

A. Tsymbal. The problem of concept drift: Definitions and related work. Technical report, Department of Computer Science, Trinity College, Dublin, Ireland, December 2004.

[23]

G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden contexts. Machine Learning, 23(1):69--101, 1996.

Cited By

Paiva BGonçalves Mda Rocha LMarcolino MLana FSouza-Silva MAlmeida JPereira Pde Andrade CGomes AFerreira MBartolazzi FSacioto MBoscato AGuimarães-Júnior Mdos Reis PCosta FJorge ACoelho LCarneiro MSales TAraújo SSilveira DRuschel KSantos FCenci EMenezes LAnschau FBicalho MManenti EFinger RPonce Dde Aguiar FMarques Lde Castro LVietta GGodoy MVilaça MMorais V(2024)A New Natural Language Processing–Inspired Methodology (Detection, Initial Characterization, and Semantic Characterization) to Investigate Temporal Shifts (Drifts) in Health Care Data: Quantitative StudyJMIR Medical Informatics10.2196/5424612(e54246)Online publication date: 28-Oct-2024
https://doi.org/10.2196/54246
Vieira VFerreira CAlmeida JMoreira ELaender AMeira WGonçalves M(2024)A network-driven study of hyperprolific authors in computer scienceScientometrics10.1007/s11192-024-04940-5129:4(2255-2283)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1007/s11192-024-04940-5
Sasikala DPremalatha K(2021)RETRACTED ARTICLE: A swarm-optimized tree-based association rule approach for classifying semi-structured data using soft computing approachSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-021-06158-625:20(12745-12758)Online publication date: 1-Oct-2021
https://dl.acm.org/doi/10.1007/s00500-021-06158-6
Show More Cited By

Index Terms

Temporally-aware algorithms for document classification
1. Applied computing
  1. Document management and text processing
2. Information systems
  1. Information retrieval

Recommendations

Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values

Each type of classifier has its own advantages as well as certain shortcomings. In this paper, we take the advantages of the associative classifier and the Naive Bayes Classifier to make up the shortcomings of each other, thus improving the accuracy of ...
A new approach to text classification based on naïve Bayes and modified TF-IDF algorithms
SCAMS '17: Proceedings of the Mediterranean Symposium on Smart City Application

In text mining, classification is a technique of assigning documents to predefined classes. Naïve Bayes algorithm is the basic of text classification technique; it is the most widely used algorithm for diverse text classification applications.

This ...
Protein subcellular localization prediction with associative classification and multi-class SVM
BCB '11: Proceedings of the 2nd ACM Conference on Bioinformatics, Computational Biology and Biomedicine

Protein subcellular localization prediction is the problem of predicting where a protein functions within a living cell. In this paper, we apply associative classifications (CMAR and CPAR) and multi-class Support Vector Machines to tackle the problem of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

July 2010

944 pages

ISBN:9781450301534

DOI:10.1145/1835449

General Chairs:
Fabio Crestani
University of Lugano, CH
,
Stéphane Marchand-Maillet
University of Geneva, CH
,
Program Chairs:
Hsin-Hsi Chen
National Taiwan University, TW
,
Efthimis N. Efthimiadis
University of Washington, USA
,
Jacques Savoy
University of Neuchatel, CH

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '10

Sponsor:

SIGIR

SIGIR '10: The 33rd International ACM SIGIR conference on research and development in Information Retrieval

July 19 - 23, 2010

Geneva, Switzerland

Acceptance Rates

SIGIR '10 Paper Acceptance Rate 87 of 520 submissions, 17%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
700
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)2

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Paiva BGonçalves Mda Rocha LMarcolino MLana FSouza-Silva MAlmeida JPereira Pde Andrade CGomes AFerreira MBartolazzi FSacioto MBoscato AGuimarães-Júnior Mdos Reis PCosta FJorge ACoelho LCarneiro MSales TAraújo SSilveira DRuschel KSantos FCenci EMenezes LAnschau FBicalho MManenti EFinger RPonce Dde Aguiar FMarques Lde Castro LVietta GGodoy MVilaça MMorais V(2024)A New Natural Language Processing–Inspired Methodology (Detection, Initial Characterization, and Semantic Characterization) to Investigate Temporal Shifts (Drifts) in Health Care Data: Quantitative StudyJMIR Medical Informatics10.2196/5424612(e54246)Online publication date: 28-Oct-2024
https://doi.org/10.2196/54246
Vieira VFerreira CAlmeida JMoreira ELaender AMeira WGonçalves M(2024)A network-driven study of hyperprolific authors in computer scienceScientometrics10.1007/s11192-024-04940-5129:4(2255-2283)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1007/s11192-024-04940-5
Sasikala DPremalatha K(2021)RETRACTED ARTICLE: A swarm-optimized tree-based association rule approach for classifying semi-structured data using soft computing approachSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-021-06158-625:20(12745-12758)Online publication date: 1-Oct-2021
https://dl.acm.org/doi/10.1007/s00500-021-06158-6
Ngo Ho AEglin VRagot NRamel J(2017)A multi-one-class dynamic classifier for adaptive digitization of document streamsInternational Journal on Document Analysis and Recognition10.1007/s10032-017-0286-620:3(137-154)Online publication date: 1-Sep-2017
https://dl.acm.org/doi/10.1007/s10032-017-0286-6
De Mel NHettiarachchi HMadusanka WMalaka GPerera AKohomban U(2016)Machine learning approach to recognize subject based sentiment values of reviews2016 Moratuwa Engineering Research Conference (MERCon)10.1109/MERCon.2016.7480107(6-11)Online publication date: Apr-2016
https://doi.org/10.1109/MERCon.2016.7480107
Fukumoto FSuzuki Y(2016)Smoothing Temporal Difference for Text CategorizationInformation Retrieval Technology10.1007/978-3-319-28940-3_16(203-214)Online publication date: 22-Jan-2016
https://doi.org/10.1007/978-3-319-28940-3_16
Salles TRocha LGonçalves MAlmeida JMourão FMeira WViegas F(2016)A quantitative analysis of the temporal effects on automatic text classificationJournal of the Association for Information Science and Technology10.1002/asi.2345267:7(1639-1667)Online publication date: 1-Jul-2016
https://dl.acm.org/doi/10.1002/asi.23452
Rocha LRamos GChaves RSachetto RMadeira DViegas FAndrade GDaniel SGonçalves MFerreira RWainwright RCorchado JBechini AHong J(2015)G-KNNProceedings of the 30th Annual ACM Symposium on Applied Computing10.1145/2695664.2695967(1335-1338)Online publication date: 13-Apr-2015
https://dl.acm.org/doi/10.1145/2695664.2695967
Aggarwal C(2014)Mining text and social streamsACM SIGKDD Explorations Newsletter10.1145/2641190.264119415:2(9-19)Online publication date: 16-Jun-2014
https://dl.acm.org/doi/10.1145/2641190.2641194
Costa MCouto FSilva MGeva STrotman ABruza PClarke CJärvelin K(2014)Learning temporal-dependent ranking modelsProceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval10.1145/2600428.2609619(757-766)Online publication date: 3-Jul-2014
https://dl.acm.org/doi/10.1145/2600428.2609619
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten