research-article

Exploiting Class Labels to Boost Performance on Embedding-based Text Classification

Author:

Arkaitz ZubiagaAuthors Info & Claims

CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management

Pages 3357 - 3360

https://doi.org/10.1145/3340531.3417444

Published: 19 October 2020 Publication History

Abstract

Text classification is one of the most frequent tasks for processing textual data, facilitating among others research from large-scale datasets. Embeddings of different kinds have recently become the de facto standard as features used for text classification. These embeddings have the capacity to capture meanings of words inferred from occurrences in large external collections. While they are built out of external collections, they are unaware of the distributional characteristics of words in the classification dataset at hand, including most importantly the distribution of words across classes in training data. To make the most of these embeddings as features and to boost the performance of classifiers using them, we introduce a weighting scheme, Term Frequency-Category Ratio (TF-CR), which can weight high-frequency, category-exclusive words higher when computing word embeddings. Our experiments on eight datasets show the effectiveness of TF-CR, leading to improved performance scores over the well-known weighting schemes TF-IDF and KLD as well as over the absence of a weighting scheme in most cases.

Supplementary Material

MP4 File (3340531.3417444.mp4)

This video presents the CIKM 2020 poster paper titled 'Exploiting Class Labels to Boost Performance on Embedding-based Text Classification'. The work presents a novel weighting scheme, TF-CR, which is proven effective for improving classification performance across a number of datasets.

Download
7.49 MB

References

[1]

E. Amigó, J. Carrillo de Albornoz, I. Chugur, A. Corujo, J. Gonzalo, T. Martín, E. Meij, M. de Rijke, and D. Spina. 2013. Overview of RepLab 2013: Evaluating Online Reputation Monitoring Systems. In Proceedings of CLEF. 333--352.

[2]

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of machine learning research, Vol. 3, Feb (2003), 1137--1155.

Digital Library

[3]

Danushka Bollegala, Tingting Mu, and John Yannis Goulermas. 2016. Cross-domain sentiment classification using sentiment sensitive embeddings. IEEE Transactions on Knowledge and Data Engineering, Vol. 28, 2 (2016), 398--410.

Digital Library

[4]

Billy Chiu, Gamal Crichton, Anna Korhonen, and Sampo Pyysalo. 2016. How to train good word embeddings for biomedical NLP. In Proceedings of the 15th Workshop on Biomedical Natural Language Processing. 166--174.

[5]

Keith Cortis, André Freitas, Tobias Daudert, Manuela Huerlimann, Manel Zarrouk, Siegfried Handschuh, and Brian Davis. 2017. Semeval-2017 task 5: Fine-grained sentiment analysis on financial microblogs and news. In Proceedings of SemEval. 519--535.

[6]

Gianna M Del Corso, Antonio Gulli, and Francesco Romani. 2005. Ranking a stream of news. In Proceedings of WWW. ACM, 97--106.

Digital Library

[7]

Antigoni-Maria Founta, Constantinos Djouvas, Despoina Chatzakou, Ilias Leontiadis, Jeremy Blackburn, Gianluca Stringhini, Athena Vakali, Michael Sirivianos, and Nicolas Kourtellis. 2018. Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior. In Proceedings of ICWSM. AAAI Press.

[8]

Debasis Ganguly, Dwaipayan Roy, Mandar Mitra, and Gareth JF Jones. 2015. Word embedding based generalized language model for information retrieval. In Proceedings of SIGIR. 795--798.

Digital Library

[9]

Alec Go, Richa Bhayani, and Lei Huang. 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, Vol. 1, 12 (2009).

[10]

Fréderic Godin, Baptist Vandersmissen, Wesley De Neve, and Rik Van de Walle. 2015. Multimedia Lab $@ $ ACL WNUT NER Shared Task: Named Entity Recognition for Twitter Microposts using Distributed Word Representations. In Proceedings of the Workshop on Noisy User-generated Text. 146--153.

[11]

Martin Grohe. 2020. word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings of Structured Data. arXiv preprint arXiv:2003.12590 (2020).

[12]

Aiqi Jiang and Arkaitz Zubiaga. 2019. Leveraging aspect phrase embeddings for cross-domain review rating prediction. PeerJ Computer Science, Vol. 5 (2019), e225.

[13]

Karen Sp"arck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, Vol. 28 (1972), 11--21.

[14]

Sicong Kuang and Brian D Davison. 2018. Class-specific word embedding through linear compositionality. In 2018 IEEE International Conference on Big Data and Smart Computing (BigComp). IEEE, 390--397.

[15]

Sicong Kuang and Brian D Davison. 2019. Learning class-specific word embeddings. The Journal of Supercomputing (2019), 1--28.

[16]

Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The annals of mathematical statistics, Vol. 22, 1 (1951), 79--86.

[17]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111--3119.

[18]

Cataldo Musto, Giovanni Semeraro, Marco de Gemmis, and Pasquale Lops. 2016. Learning word embeddings from wikipedia for content-based recommender systems. In European Conference on Information Retrieval. Springer, 729--734.

[19]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of EMNLP. 1532--1543.

[20]

Mohammad Taher Pilehvar and Jose Camacho-Collados. 2020. Embeddings in Natural Language Processing: Theory and Advances in Vector Representation of Meaning .Morgan & Claypool.

[21]

Sara Rosenthal, Noura Farra, and Preslav Nakov. 2017. SemEval-2017 task 4: Sentiment analysis in Twitter. In Proceedings of SemEval. 502--518.

[22]

Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. Information processing & management, Vol. 24, 5 (1988), 513--523.

[23]

Fabrizio Sebastiani. 2002. Machine learning in automated text categorization. ACM computing surveys (CSUR), Vol. 34, 1 (2002), 1--47.

[24]

Duyu Tang. 2015. Sentiment-specific representation learning for document-level sentiment analysis. In Proceedings of WSDM. ACM, 447--452.

Digital Library

[25]

Duyu Tang, Furu Wei, Bing Qin, Nan Yang, Ting Liu, and Ming Zhou. 2016. Sentiment embeddings with applications to sentiment analysis. IEEE Transactions on Knowledge and Data Engineering, Vol. 28, 2 (2016), 496--509.

Digital Library

[26]

Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. 2014. Learning sentiment-specific word embedding for twitter sentiment classification. In Proceedings of ACL, Vol. 1. 1555--1565.

[27]

Peng Wang, Bo Xu, Jiaming Xu, Guanhua Tian, Cheng-Lin Liu, and Hongwei Hao. 2016. Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing, Vol. 174 (2016), 806--814.

Digital Library

[28]

Ruqing Zhang, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng. 2018. Aggregating Neural Word Embeddings for Document Representation. In Proceedings of ECIR. Springer, 303--315.

[29]

Will Y Zou, Richard Socher, Daniel Cer, and Christopher D Manning. 2013. Bilingual word embeddings for phrase-based machine translation. In Proceedings of EMNLP. 1393--1398.

[30]

Arkaitz Zubiaga and Heng Ji. 2013. Harnessing web page directories for large-scale classification of tweets. In Proceedings of WWW. ACM, 225--226.

Digital Library

[31]

Arkaitz Zubiaga and Aiqi Jiang. 2020. Early Detection of Social Media Hoaxes at Scale. ACM TWEB (2020).

Cited By

Bhoi ABalabantaray RSahoo DDhiman GKhare MNarducci FKaur A(2022)Mining social media text for disaster resource management using a feature selection based on forest optimizationComputers & Industrial Engineering10.1016/j.cie.2022.108280169(108280)Online publication date: Jul-2022
https://doi.org/10.1016/j.cie.2022.108280
Zhang HLi ZXie HLau RCheng GLi QZhang D(2022)Leveraging statistical information in fine-grained financial sentiment analysisWorld Wide Web10.1007/s11280-021-00993-125:2(513-531)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.1007/s11280-021-00993-1

Index Terms

Exploiting Class Labels to Boost Performance on Embedding-based Text Classification
1. Computing methodologies
  1. Artificial intelligence
    1. Knowledge representation and reasoning
    2. Natural language processing
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification

Recommendations

Multi-prototype Morpheme Embedding for Text Classification
SMA 2020: The 9th International Conference on Smart Media and Applications

Representing a word into a continuous space, also known as a word vector, has been successful in various NLP tasks. The word-based embedding has two problems; one is the out-of-vocabulary problem and the other is does not take into account the context ...
Urdu text classification
FIT '09: Proceedings of the 7th International Conference on Frontiers of Information Technology

This paper compares statistical techniques for text classification using Naïve Bayes and Support Vector Machines, in context of Urdu language. A large corpus is used for training and testing purpose of the classifiers. However, those classifiers cannot ...
Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values

Each type of classifier has its own advantages as well as certain shortcomings. In this paper, we take the advantages of the associative classifier and the Naive Bayes Classifier to make up the shortcomings of each other, thus improving the accuracy of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management

October 2020

3619 pages

ISBN:9781450368599

DOI:10.1145/3340531

General Chairs:
Mathieu d'Aquin
DSI, Insight, NUI Galway, Ireland
,
Stefan Dietze
GESIS, Cologne, Germany, Heinrich-Heine-University Düsseldorf, Germany, L3S Research Center, Germany
,
Program Chairs:
Claudia Hauff
TU Delft, The Netherlands
,
Edward Curry
DSI, Insight, NUI Galway, Ireland
,
Philippe Cudre Mauroux
eXascale, University of Fribourg, Switzerland

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM '20

Sponsor:

CIKM '20: The 29th ACM International Conference on Information and Knowledge Management

October 19 - 23, 2020

Virtual Event, Ireland

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
141
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bhoi ABalabantaray RSahoo DDhiman GKhare MNarducci FKaur A(2022)Mining social media text for disaster resource management using a feature selection based on forest optimizationComputers & Industrial Engineering10.1016/j.cie.2022.108280169(108280)Online publication date: Jul-2022
https://doi.org/10.1016/j.cie.2022.108280
Zhang HLi ZXie HLau RCheng GLi QZhang D(2022)Leveraging statistical information in fine-grained financial sentiment analysisWorld Wide Web10.1007/s11280-021-00993-125:2(513-531)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.1007/s11280-021-00993-1

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten