Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3331184.3331239acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Similarity-Based Synthetic Document Representations for Meta-Feature Generation in Text Classification

Published: 18 July 2019 Publication History

Abstract

We propose new solutions that enhance and extend the already very successful application of meta-features to text classification. Our newly proposed meta-features are capable of: (1) improving the correlation of small pieces of evidence shared by neighbors with labeled categories by means of synthetic document representations and (local and global) hyperplane distances; and (2) estimating the level of error introduced by these newly proposed and the existing meta-features in the literature, specially for hard-to-classify regions of the feature space. Our experiments with large and representative number of datasets show that our new solutions produce the best results in all tested scenarios, achieving gains of up to 12% over the strongest meta-feature proposal of the literature.

Supplementary Material

MP4 File (cite2-11h20-d2.mp4)

References

[1]
Sergio Canuto, Marcos André Gonçalves, and Fabrício Benevenuto. 2016. Exploiting New Sentiment-Based Meta-level Features for Effective Sentiment Analysis. In WSDM. ACM, 53--62.
[2]
Sergio Canuto, Goncalves, Marcos, Wisllay, Santos, Thierson, Rosa, and Martins, Wellington. 2015. Efficient and Scalable MetaFeature-based Document Classification using Massively Parallel Computing. In SIGIR. 333--342.
[3]
Sergio Canuto, Thiago Salles, Marcos André Gonçalves, Leonardo Rocha, Gabriel Ramos, Luiz Gonçalves, Thierson Rosa, and Wellington Martins. 2014. On Efficient Meta-Level Features for Effective Text Classification. In CIKM. 1709--1718.
[4]
Sergio Canuto, Daniel Xavier Sousa, Marcos Andre Goncalves, and Thierson Couto Rosa. 2018. A Thorough Evaluation of Distance-Based Meta-Features for Automated Text Classification. IEEE Transactions on Knowledge and Data Engineering (2018), 1--1.
[5]
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A Library for Large Linear Classification. JMLR, Vol. 9 (2008), 1871--1874.
[6]
Siddharth Gopal and Yiming Yang. 2010. Multilabel classification with meta-level features. In Proc. SIGIR. 315--322.
[7]
Raj Jain. 1991. The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling. Wiley. I-XXVII, 1--685 pages.
[8]
Antonia Kyriakopoulou and Theodore Kalamboukis. 2007. Using clustering to enhance text classification. In SIGIR'07. 805--806.
[9]
Antonia Kyriakopoulou and Theodore Kalamboukis. 2007. Using Clustering to Enhance Text Classification. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '07). ACM, New York, NY, USA, 805--806.
[10]
A. Kyriakopoulou and T. Kalamboukis. 2008. Combining Clustering with Classification for Spam Detection in Social Bookmarking Systems (RSDC '08).
[11]
Quoc Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32 (ICML'14). JMLR.org, II-1188--II-1196. http://dl.acm.org/citation.cfm?id=3044805.3045025
[12]
Guy Lev, Benjamin Klein, and Lior Wolf. 2015. In Defense of Word Embedding for Generic Text Representation. In NLDB (Lecture Notes in Computer Science), Chris Biemann, Siegfried Handschuh, André Freitas, Farid Meziane, and Elisabeth Métais (Eds.), Vol. 9103. Springer, 35--50. http://dblp.uni-trier.de/db/conf/nldb/nldb2015.html#LevKW15
[13]
David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. 2004. RCV1: A New Benchmark Collection for Text Categorization Research. JMLR. Vol. 5 (2004), 361--397.
[14]
Guansong Pang, Huidong Jin, and Shengyi Jiang. 2015. CenKNN: a scalable and effective text classifier. DMKD, Vol. 29, 3 (2015), 593--625.
[15]
John C. Platt. 1999. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. In ADVANCES IN LARGE MARGIN CLASSIFIERS. MIT Press, 61--74.
[16]
Bhavani Raskutti, Herman L. Ferrá, and Adam Kowalczyk. 2002. Using Unlabelled Data for Text Classification through Addition of Cluster Parameters. In ICML'02. 514--521.
[17]
Jian Tang, Meng Qu, and Qiaozhu Mei. 2015. PTE: Predictive Text Embedding Through Large-scale Heterogeneous Text Networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15). ACM, New York, NY, USA, 1165--1174.
[18]
Yiming Yang. 1999. An Evaluation of Statistical Approaches to Text Categorization. Inf. Ret. Vol. 1 (1999), 69--90. Issue 1-2.
[19]
Yiming Yang and Siddharth Gopal. 2012. Multilabel classification with meta-level features in a learning-to-rank framework. JMLR, Vol. 88 (2012), 47--68. Issue 1-2.
[20]
Hao Zhang, Alexander C. Berg, Michael Maire, and Jitendra Malik. 2006. SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2 (CVPR '06). IEEE Computer Society, Washington, DC, USA, 2126--2136.

Cited By

View all
  • (2023)A Comparative Survey of Instance Selection Methods applied to Non-Neural and Transformer-Based Text ClassificationACM Computing Surveys10.1145/358200055:13s(1-52)Online publication date: 13-Jul-2023
  • (2023)An Effective, Efficient, and Scalable Confidence-based Instance Selection Framework for Transformer-Based Text ClassificationProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591638(665-674)Online publication date: 19-Jul-2023
  • (2021)Stroke Outcome Measurements From Electronic Medical Records: Cross-sectional Study on the Effectiveness of Neural and Nonneural ClassifiersJMIR Medical Informatics10.2196/291209:11(e29120)Online publication date: 1-Nov-2021
  • Show More Cited By

Index Terms

  1. Similarity-Based Synthetic Document Representations for Meta-Feature Generation in Text Classification

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval
    July 2019
    1512 pages
    ISBN:9781450361729
    DOI:10.1145/3331184
    Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 July 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. classification
    2. machine learning
    3. meta-features

    Qualifiers

    • Research-article

    Funding Sources

    • Fapemig
    • CNPq
    • Capes

    Conference

    SIGIR '19
    Sponsor:

    Acceptance Rates

    SIGIR'19 Paper Acceptance Rate 84 of 426 submissions, 20%;
    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 12 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)A Comparative Survey of Instance Selection Methods applied to Non-Neural and Transformer-Based Text ClassificationACM Computing Surveys10.1145/358200055:13s(1-52)Online publication date: 13-Jul-2023
    • (2023)An Effective, Efficient, and Scalable Confidence-based Instance Selection Framework for Transformer-Based Text ClassificationProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591638(665-674)Online publication date: 19-Jul-2023
    • (2021)Stroke Outcome Measurements From Electronic Medical Records: Cross-sectional Study on the Effectiveness of Neural and Nonneural ClassifiersJMIR Medical Informatics10.2196/291209:11(e29120)Online publication date: 1-Nov-2021
    • (2021)On the cost-effectiveness of neural and non-neural approaches and representations for text classification: A comprehensive comparative studyInformation Processing & Management10.1016/j.ipm.2020.10248158:3(102481)Online publication date: May-2021
    • (2020)"Keep it Simple, Lazy" -- MetaLazy: A New MetaStrategy for Lazy Text ClassificationProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3412180(1125-1134)Online publication date: 19-Oct-2020
    • (2020)Exploiting semantic relationships for unsupervised expansion of sentiment lexiconsInformation Systems10.1016/j.is.2020.10160694(101606)Online publication date: Dec-2020

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media