Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3340531.3417444acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Exploiting Class Labels to Boost Performance on Embedding-based Text Classification

Published: 19 October 2020 Publication History

Abstract

Text classification is one of the most frequent tasks for processing textual data, facilitating among others research from large-scale datasets. Embeddings of different kinds have recently become the de facto standard as features used for text classification. These embeddings have the capacity to capture meanings of words inferred from occurrences in large external collections. While they are built out of external collections, they are unaware of the distributional characteristics of words in the classification dataset at hand, including most importantly the distribution of words across classes in training data. To make the most of these embeddings as features and to boost the performance of classifiers using them, we introduce a weighting scheme, Term Frequency-Category Ratio (TF-CR), which can weight high-frequency, category-exclusive words higher when computing word embeddings. Our experiments on eight datasets show the effectiveness of TF-CR, leading to improved performance scores over the well-known weighting schemes TF-IDF and KLD as well as over the absence of a weighting scheme in most cases.

Supplementary Material

MP4 File (3340531.3417444.mp4)
This video presents the CIKM 2020 poster paper titled 'Exploiting Class Labels to Boost Performance on Embedding-based Text Classification'. The work presents a novel weighting scheme, TF-CR, which is proven effective for improving classification performance across a number of datasets.

References

[1]
E. Amigó, J. Carrillo de Albornoz, I. Chugur, A. Corujo, J. Gonzalo, T. Martín, E. Meij, M. de Rijke, and D. Spina. 2013. Overview of RepLab 2013: Evaluating Online Reputation Monitoring Systems. In Proceedings of CLEF. 333--352.
[2]
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of machine learning research, Vol. 3, Feb (2003), 1137--1155.
[3]
Danushka Bollegala, Tingting Mu, and John Yannis Goulermas. 2016. Cross-domain sentiment classification using sentiment sensitive embeddings. IEEE Transactions on Knowledge and Data Engineering, Vol. 28, 2 (2016), 398--410.
[4]
Billy Chiu, Gamal Crichton, Anna Korhonen, and Sampo Pyysalo. 2016. How to train good word embeddings for biomedical NLP. In Proceedings of the 15th Workshop on Biomedical Natural Language Processing. 166--174.
[5]
Keith Cortis, André Freitas, Tobias Daudert, Manuela Huerlimann, Manel Zarrouk, Siegfried Handschuh, and Brian Davis. 2017. Semeval-2017 task 5: Fine-grained sentiment analysis on financial microblogs and news. In Proceedings of SemEval. 519--535.
[6]
Gianna M Del Corso, Antonio Gulli, and Francesco Romani. 2005. Ranking a stream of news. In Proceedings of WWW. ACM, 97--106.
[7]
Antigoni-Maria Founta, Constantinos Djouvas, Despoina Chatzakou, Ilias Leontiadis, Jeremy Blackburn, Gianluca Stringhini, Athena Vakali, Michael Sirivianos, and Nicolas Kourtellis. 2018. Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior. In Proceedings of ICWSM. AAAI Press.
[8]
Debasis Ganguly, Dwaipayan Roy, Mandar Mitra, and Gareth JF Jones. 2015. Word embedding based generalized language model for information retrieval. In Proceedings of SIGIR. 795--798.
[9]
Alec Go, Richa Bhayani, and Lei Huang. 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, Vol. 1, 12 (2009).
[10]
Fréderic Godin, Baptist Vandersmissen, Wesley De Neve, and Rik Van de Walle. 2015. Multimedia Lab $@ $ ACL WNUT NER Shared Task: Named Entity Recognition for Twitter Microposts using Distributed Word Representations. In Proceedings of the Workshop on Noisy User-generated Text. 146--153.
[11]
Martin Grohe. 2020. word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings of Structured Data. arXiv preprint arXiv:2003.12590 (2020).
[12]
Aiqi Jiang and Arkaitz Zubiaga. 2019. Leveraging aspect phrase embeddings for cross-domain review rating prediction. PeerJ Computer Science, Vol. 5 (2019), e225.
[13]
Karen Sp"arck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, Vol. 28 (1972), 11--21.
[14]
Sicong Kuang and Brian D Davison. 2018. Class-specific word embedding through linear compositionality. In 2018 IEEE International Conference on Big Data and Smart Computing (BigComp). IEEE, 390--397.
[15]
Sicong Kuang and Brian D Davison. 2019. Learning class-specific word embeddings. The Journal of Supercomputing (2019), 1--28.
[16]
Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The annals of mathematical statistics, Vol. 22, 1 (1951), 79--86.
[17]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111--3119.
[18]
Cataldo Musto, Giovanni Semeraro, Marco de Gemmis, and Pasquale Lops. 2016. Learning word embeddings from wikipedia for content-based recommender systems. In European Conference on Information Retrieval. Springer, 729--734.
[19]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of EMNLP. 1532--1543.
[20]
Mohammad Taher Pilehvar and Jose Camacho-Collados. 2020. Embeddings in Natural Language Processing: Theory and Advances in Vector Representation of Meaning .Morgan & Claypool.
[21]
Sara Rosenthal, Noura Farra, and Preslav Nakov. 2017. SemEval-2017 task 4: Sentiment analysis in Twitter. In Proceedings of SemEval. 502--518.
[22]
Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. Information processing & management, Vol. 24, 5 (1988), 513--523.
[23]
Fabrizio Sebastiani. 2002. Machine learning in automated text categorization. ACM computing surveys (CSUR), Vol. 34, 1 (2002), 1--47.
[24]
Duyu Tang. 2015. Sentiment-specific representation learning for document-level sentiment analysis. In Proceedings of WSDM. ACM, 447--452.
[25]
Duyu Tang, Furu Wei, Bing Qin, Nan Yang, Ting Liu, and Ming Zhou. 2016. Sentiment embeddings with applications to sentiment analysis. IEEE Transactions on Knowledge and Data Engineering, Vol. 28, 2 (2016), 496--509.
[26]
Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. 2014. Learning sentiment-specific word embedding for twitter sentiment classification. In Proceedings of ACL, Vol. 1. 1555--1565.
[27]
Peng Wang, Bo Xu, Jiaming Xu, Guanhua Tian, Cheng-Lin Liu, and Hongwei Hao. 2016. Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing, Vol. 174 (2016), 806--814.
[28]
Ruqing Zhang, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng. 2018. Aggregating Neural Word Embeddings for Document Representation. In Proceedings of ECIR. Springer, 303--315.
[29]
Will Y Zou, Richard Socher, Daniel Cer, and Christopher D Manning. 2013. Bilingual word embeddings for phrase-based machine translation. In Proceedings of EMNLP. 1393--1398.
[30]
Arkaitz Zubiaga and Heng Ji. 2013. Harnessing web page directories for large-scale classification of tweets. In Proceedings of WWW. ACM, 225--226.
[31]
Arkaitz Zubiaga and Aiqi Jiang. 2020. Early Detection of Social Media Hoaxes at Scale. ACM TWEB (2020).

Cited By

View all
  • (2022)Mining social media text for disaster resource management using a feature selection based on forest optimizationComputers & Industrial Engineering10.1016/j.cie.2022.108280169(108280)Online publication date: Jul-2022
  • (2022)Leveraging statistical information in fine-grained financial sentiment analysisWorld Wide Web10.1007/s11280-021-00993-125:2(513-531)Online publication date: 1-Mar-2022

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management
October 2020
3619 pages
ISBN:9781450368599
DOI:10.1145/3340531
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. embeddings
  2. text classification
  3. weighting schemes

Qualifiers

  • Research-article

Conference

CIKM '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Mining social media text for disaster resource management using a feature selection based on forest optimizationComputers & Industrial Engineering10.1016/j.cie.2022.108280169(108280)Online publication date: Jul-2022
  • (2022)Leveraging statistical information in fine-grained financial sentiment analysisWorld Wide Web10.1007/s11280-021-00993-125:2(513-531)Online publication date: 1-Mar-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media