research-article

Similarity-Based Synthetic Document Representations for Meta-Feature Generation in Text Classification

Authors:

Thierson C. Rosa,

Marcos A. GonçalvesAuthors Info & Claims

SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 355 - 364

https://doi.org/10.1145/3331184.3331239

Published: 18 July 2019 Publication History

Abstract

We propose new solutions that enhance and extend the already very successful application of meta-features to text classification. Our newly proposed meta-features are capable of: (1) improving the correlation of small pieces of evidence shared by neighbors with labeled categories by means of synthetic document representations and (local and global) hyperplane distances; and (2) estimating the level of error introduced by these newly proposed and the existing meta-features in the literature, specially for hard-to-classify regions of the feature space. Our experiments with large and representative number of datasets show that our new solutions produce the best results in all tested scenarios, achieving gains of up to 12% over the strongest meta-feature proposal of the literature.

Supplementary Material

MP4 File (cite2-11h20-d2.mp4)

Download
489.32 MB

References

[1]

Sergio Canuto, Marcos André Gonçalves, and Fabrício Benevenuto. 2016. Exploiting New Sentiment-Based Meta-level Features for Effective Sentiment Analysis. In WSDM. ACM, 53--62.

Digital Library

[2]

Sergio Canuto, Goncalves, Marcos, Wisllay, Santos, Thierson, Rosa, and Martins, Wellington. 2015. Efficient and Scalable MetaFeature-based Document Classification using Massively Parallel Computing. In SIGIR. 333--342.

Digital Library

[3]

Sergio Canuto, Thiago Salles, Marcos André Gonçalves, Leonardo Rocha, Gabriel Ramos, Luiz Gonçalves, Thierson Rosa, and Wellington Martins. 2014. On Efficient Meta-Level Features for Effective Text Classification. In CIKM. 1709--1718.

Digital Library

[4]

Sergio Canuto, Daniel Xavier Sousa, Marcos Andre Goncalves, and Thierson Couto Rosa. 2018. A Thorough Evaluation of Distance-Based Meta-Features for Automated Text Classification. IEEE Transactions on Knowledge and Data Engineering (2018), 1--1.

[5]

Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A Library for Large Linear Classification. JMLR, Vol. 9 (2008), 1871--1874.

Digital Library

[6]

Siddharth Gopal and Yiming Yang. 2010. Multilabel classification with meta-level features. In Proc. SIGIR. 315--322.

Digital Library

[7]

Raj Jain. 1991. The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling. Wiley. I-XXVII, 1--685 pages.

[8]

Antonia Kyriakopoulou and Theodore Kalamboukis. 2007. Using clustering to enhance text classification. In SIGIR'07. 805--806.

Digital Library

[9]

Antonia Kyriakopoulou and Theodore Kalamboukis. 2007. Using Clustering to Enhance Text Classification. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '07). ACM, New York, NY, USA, 805--806.

Digital Library

[10]

A. Kyriakopoulou and T. Kalamboukis. 2008. Combining Clustering with Classification for Spam Detection in Social Bookmarking Systems (RSDC '08).

[11]

Quoc Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32 (ICML'14). JMLR.org, II-1188--II-1196. http://dl.acm.org/citation.cfm?id=3044805.3045025

Digital Library

[12]

Guy Lev, Benjamin Klein, and Lior Wolf. 2015. In Defense of Word Embedding for Generic Text Representation. In NLDB (Lecture Notes in Computer Science), Chris Biemann, Siegfried Handschuh, André Freitas, Farid Meziane, and Elisabeth Métais (Eds.), Vol. 9103. Springer, 35--50. http://dblp.uni-trier.de/db/conf/nldb/nldb2015.html#LevKW15

[13]

David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. 2004. RCV1: A New Benchmark Collection for Text Categorization Research. JMLR. Vol. 5 (2004), 361--397.

Digital Library

[14]

Guansong Pang, Huidong Jin, and Shengyi Jiang. 2015. CenKNN: a scalable and effective text classifier. DMKD, Vol. 29, 3 (2015), 593--625.

Digital Library

[15]

John C. Platt. 1999. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. In ADVANCES IN LARGE MARGIN CLASSIFIERS. MIT Press, 61--74.

[16]

Bhavani Raskutti, Herman L. Ferrá, and Adam Kowalczyk. 2002. Using Unlabelled Data for Text Classification through Addition of Cluster Parameters. In ICML'02. 514--521.

Digital Library

[17]

Jian Tang, Meng Qu, and Qiaozhu Mei. 2015. PTE: Predictive Text Embedding Through Large-scale Heterogeneous Text Networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15). ACM, New York, NY, USA, 1165--1174.

Digital Library

[18]

Yiming Yang. 1999. An Evaluation of Statistical Approaches to Text Categorization. Inf. Ret. Vol. 1 (1999), 69--90. Issue 1-2.

Digital Library

[19]

Yiming Yang and Siddharth Gopal. 2012. Multilabel classification with meta-level features in a learning-to-rank framework. JMLR, Vol. 88 (2012), 47--68. Issue 1-2.

Digital Library

[20]

Hao Zhang, Alexander C. Berg, Michael Maire, and Jitendra Malik. 2006. SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2 (CVPR '06). IEEE Computer Society, Washington, DC, USA, 2126--2136.

Digital Library

Cited By

Cunha WMoreo Fernández AEsuli ASebastiani FRocha LGonçalves M(2025)A Noise-Oriented and Redundancy-Aware Instance Selection FrameworkACM Transactions on Information Systems10.1145/370500043:2(1-33)Online publication date: 17-Jan-2025
https://dl.acm.org/doi/10.1145/3705000
Cunha WViegas FFrança CRosa TRocha LGonçalves M(2023)A Comparative Survey of Instance Selection Methods applied to Non-Neural and Transformer-Based Text ClassificationACM Computing Surveys10.1145/358200055:13s(1-52)Online publication date: 13-Jul-2023
https://dl.acm.org/doi/10.1145/3582000
Cunha WFrança CFonseca GRocha LGonçalves MChen HDuh WHuang HKato MMothe JPoblete B(2023)An Effective, Efficient, and Scalable Confidence-based Instance Selection Framework for Transformer-Based Text ClassificationProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591638(665-674)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591638
Show More Cited By

Recommendations

Classifier ensemble generation and selection with multiple feature representations for classification applications in computer-aided detection and diagnosis on mammography

Novel ensemble classifier framework for improved classification of breast lesions.Ensemble generation algorithm using different types of breast lesion features.Ensemble selection mechanism to find an optimal subset of component classifiers.Impressive ...
A novel Bagged Naïve Bayes-Decision Tree approach for multi-class classification problems
Soft Computing and Intelligent Systems: Techniques and Applications

Breakthrough classification performances have been achieved by utilizing ensemble techniques in machine learning and data mining. Bagging is one such ensemble technique that has outperformed single models in obtaining higher predictive performances. This ...
Building boosted classification tree ensemble with genetic programming
GECCO '18: Proceedings of the Genetic and Evolutionary Computation Conference Companion

Adaptive boosting (AdaBoost) is a method for building classification ensemble, which combines multiple classifiers built in an iterative process of reweighting instances. This method proves to be a very effective classification method, therefore it was ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2019

1512 pages

ISBN:9781450361729

DOI:10.1145/3331184

General Chairs:
Benjamin Piwowarski
CNRS - Sorbonne Universite, France
,
Max Chevalier
Universite de Toulouse, CNRS, France
,
Eric Gaussier
Universite Grenoble Alpes, CNRS, France
,
Program Chairs:
Yoelle Maarek
Amazon Research, Israel
,
Jian-Yun Nie
University of Montreal, Canada
,
Falk Scholer
RMIT University, Australia

Copyright © 2019 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Fapemig
CNPq
Capes

Conference

SIGIR '19

Sponsor:

SIGIR

SIGIR '19: The 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

July 21 - 25, 2019

Paris, France

Acceptance Rates

SIGIR'19 Paper Acceptance Rate 84 of 426 submissions, 20%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
460
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Cunha WMoreo Fernández AEsuli ASebastiani FRocha LGonçalves M(2025)A Noise-Oriented and Redundancy-Aware Instance Selection FrameworkACM Transactions on Information Systems10.1145/370500043:2(1-33)Online publication date: 17-Jan-2025
https://dl.acm.org/doi/10.1145/3705000
Cunha WViegas FFrança CRosa TRocha LGonçalves M(2023)A Comparative Survey of Instance Selection Methods applied to Non-Neural and Transformer-Based Text ClassificationACM Computing Surveys10.1145/358200055:13s(1-52)Online publication date: 13-Jul-2023
https://dl.acm.org/doi/10.1145/3582000
Cunha WFrança CFonseca GRocha LGonçalves MChen HDuh WHuang HKato MMothe JPoblete B(2023)An Effective, Efficient, and Scalable Confidence-based Instance Selection Framework for Transformer-Based Text ClassificationProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591638(665-674)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591638
Zanotto BBeck da Silva Etges Adal Bosco ACortes ERuschel RDe Souza AAndrade CViegas FCanuto SLuiz WOuriques Martins SVieira RPolanczyk CAndré Gonçalves M(2021)Stroke Outcome Measurements From Electronic Medical Records: Cross-sectional Study on the Effectiveness of Neural and Nonneural ClassifiersJMIR Medical Informatics10.2196/291209:11(e29120)Online publication date: 1-Nov-2021
https://doi.org/10.2196/29120
Cunha WMangaravite VGomes CCanuto SResende ENascimento CViegas FFrança CMartins WAlmeida JRosa TRocha LGonçalves M(2021)On the cost-effectiveness of neural and non-neural approaches and representations for text classification: A comprehensive comparative studyInformation Processing & Management10.1016/j.ipm.2020.10248158:3(102481)Online publication date: May-2021
https://doi.org/10.1016/j.ipm.2020.102481
Mendes LGonçalves MCunha WRocha LCouto-Rosa TMartins Wd'Aquin MDietze SHauff CCurry ECudre Mauroux P(2020)"Keep it Simple, Lazy" -- MetaLazy: A New MetaStrategy for Lazy Text ClassificationProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3412180(1125-1134)Online publication date: 19-Oct-2020
https://dl.acm.org/doi/10.1145/3340531.3412180
Viegas FAlvim MCanuto SRosa TGonçalves MRocha L(2020)Exploiting semantic relationships for unsupervised expansion of sentiment lexiconsInformation Systems10.1016/j.is.2020.10160694(101606)Online publication date: Dec-2020
https://doi.org/10.1016/j.is.2020.101606

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten