Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3486622.3493960acmconferencesArticle/Chapter ViewAbstractPublication PageswiConference Proceedingsconference-collections
research-article
Open access

On the Impact of Dataset Size:A Twitter Classification Case Study

Published: 13 April 2022 Publication History

Abstract

The recent advent and evolution of deep learning models and pre-trained embedding techniques have created a breakthrough in supervised learning. Typically, we expect that adding more labeled data improves the predictive performance of supervised models. On the other hand, collecting more labeled data is not an easy task due to several difficulties, such as manual labor costs, data privacy, and computational constraint. Hence, a comprehensive study on the relation between training set size and the classification performance of different methods could be essentially useful in the selection of a learning model for a specific task. However, the literature lacks such a thorough and systematic study. In this paper, we concentrate on this relationship in the context of short, noisy texts from Twitter. We design a systematic mechanism to comprehensively observe the performance improvement of supervised learning models with the increase of data sizes on three well-known Twitter tasks: sentiment analysis, informativeness detection, and information relevance. Besides, we study how significantly better the recent deep learning models are compared to traditional machine learning approaches in the case of various data sizes. Our extensive experiments show (a) recent pre-trained models have overcome big data requirements, (b) a good choice of text representation has more impact than adding more data, and (c) adding more data is not always beneficial in supervised learning.

References

[1]
Piush Aggarwal. 2019. Classification approaches to identify informative tweets. In Proceedings of the Student Research Workshop Associated with RANLP.
[2]
Ahmed Sulaiman M Alharbi and Elise de Doncker. 2019. Twitter sentiment analysis with a deep neural network: An enhanced approach using user behavioral information. Cognitive Systems Research(2019).
[3]
Alhanoof Althnian, Duaa AlSaeed, Heyam Al-Baity, Amani Samha, Alanoud Bin Dris, Najla Alzakari, Afnan Abou Elwafa, and Heba Kurdi. 2021. Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain. Applied Sciences (2021).
[4]
Alessio Benavoli, Giorgio Corani, and Francesca Mangili. 2016. Should we really use post-hoc tests based on mean-ranks?Machine Learning Research(2016).
[5]
Cody Buntain, Jennifer Golbeck, Brooke Liu, and Gary LaFree. 2016. Evaluating Public Response to the Boston Marathon Bombing and Other Acts of Terrorism through Twitter. In ICWSM.
[6]
Pete Burnap, Gualtiero Colombo, Rosie Amery, Andrei Hodorog, and Jonathan Scourfield. 2017. Multi-class machine classification of suicide-related communication on Twitter. Online social networks and media 2 (2017), 32–44.
[7]
Junghwan Cho, Kyewook Lee, Ellie Shin, Garry Choy, and Synho Do. 2015. How much data is needed to train a medical image deep learning system to achieve necessary high accuracy?. In arXiv preprint arXiv:1511.06348.
[8]
Jacob Delvin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT.
[9]
Nicholas A Diakopoulos and David A Shamma. 2010. Characterizing debate performance via aggregated twitter sentiment. In SIGCHI.
[10]
Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. 2015. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In IJCAI.
[11]
Rosa L Figueroa, Qing Zeng-Treitler, Sasikiran Kandula, and Long H Ngo. 2012. Predicting sample size required for classification performance. BMC medical informatics and decision making(2012).
[12]
Lewis J Frey and Douglas H Fisher. 1999. Modeling decision tree performance with the power law. In Seventh International Workshop on Artificial Intelligence and Statistics. PMLR.
[13]
Milton Friedman. 1940. A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics(1940).
[14]
Alec Go, Richa Bhayani, and Lei Huang. 2009. Twitter sentiment classification using distant supervision. CS224N project report, Stanford(2009).
[15]
Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber. 2016. LSTM: A search space odyssey. IEEE transactions on neural networks and learning systems (2016).
[16]
Baohua Gu, Feifang Hu, and Huan Liu. 2001. Modelling classification performance for large data sets. In International Conference on Web-Age Information Management.
[17]
M.A. Hearst, S.T. Dumais, E. Osuna, J. Platt, and B. Scholkopf. 1998. Support vector machines. IEEE Intelligent Systems and their applications (1998).
[18]
Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Patwary, Mostofa Ali, Yang Yang, and Yanqi Zhou. 2017. Deep learning scaling is predictable, empirically. In arXiv preprint arXiv:1712.00409.
[19]
Tuan-Anh Hoang, Thi Huyen Nguyen, and Wolfgang Nejdl. 2019. Efficient Tracking of Breaking News in Twitter. In WebSci.
[20]
Mark Johnson, Peter Anderson, Mark Dras, and Mark Steedman. 2018. Predicting accuracy on large datasets from smaller pilot data. In ACL.
[21]
Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.
[22]
Prasanth Kolachina, Nicola Cancedda, Marc Dymetman, and Sriram Venkatapathy. 2012. Prediction of learning curves in machine translation. In ACL.
[23]
David D. Lewis. 1998. The independence assumption in information retrieval. In ECML.
[24]
Trond Linjordet and Krisztian Balog. 2019. Impact of Training Dataset Size on Neural Answer Selection Models. In ECIR.
[25]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NeurIPS.
[26]
Sheeba Naz, Aditi Sharan, and Nidhi Malik. 2018. Sentiment classification on twitter data using support vector machine. In WI.
[27]
Dat Quoc Nguyen, Thanh Vu, and Anh Tuan Nguyen. 2020. BERTweet: A pre-trained language model for English Tweets. In EMNLP: System Demonstrations.
[28]
Dat Quoc Nguyen, Thanh Vu, Afshin Rahimi, Mai Hoang Dao, Linh The Nguyen, and Long Doan. 2020. WNUT-2020 Task 2: Identification of Informative COVID-19 English Tweets. In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020).
[29]
Dat Tien Nguyen, Kamela Ali Al Mannai, Shafiq Joty, Hassan Sajjad, Muhammad Imran, and Prasenjit Mitra. 2017. Robust Classification of Crisis-Related Data on Social Networks Using Convolutional Neural Networks. In ICWSM.
[30]
Thi Huyen Nguyen, Tuan-Anh Hoang, and Wolfgang Nejdl. 2019. Efficient Summarizing of Evolving Events from Twitter Streams. In SDM.
[31]
Alexandra Olteanu, Carlos Castillo, Fernando Diaz, and Sarah Vieweg. 2014. Crisislex: A lexicon for collecting and filtering microblogged communications in crises. In ICWSM.
[32]
Joseph Prusa, Taghi M. Khoshgoftaar, and Naeem Seliya. 2015. The Effect of Dataset Size on Training Tweet Sentiment Classifiers. ICMLA (2015).
[33]
C Rossi, FS Acerbo, K Ylinen, I Juga, P Nurmi, A Bosca, F Tarasconi, M Cristoforetti, and A Alikadic. 2018. Early detection and information extraction for weather-induced floods using social media streams. International journal of disaster risk reduction (2018).
[34]
Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2010. Earthquake shakes Twitter users: real-time event detection by social sensors. In TheWebConf.
[35]
John Shawe-Taylor, Martin Anthony, and N.L.Biggs. 1993. Bounding sample size with the Vapnik-Chervonenkis dimension. Discrete Applied Mathematics(1993).
[36]
Bing Xiang and Liang Zhou. 2014. Improving twitter sentiment analysis with topic-based mixture modeling and semi-supervised training. In ACL.
[37]
David Zimbra, Ahmed Abbasi, Daniel Zeng, and Hsinchun Chen. 2018. The state-of-the-art in Twitter sentiment analysis: A review and benchmark evaluation. TMIS (2018).

Cited By

View all
  • (2024)The Effect of Training Data Size on Disaster Classification from TwitterInformation10.3390/info1507039315:7(393)Online publication date: 8-Jul-2024
  • (2024)Sentiment Analysis of TikTok Comments on Indonesian Presidential Elections Using IndoBERT2024 3rd International Conference on Creative Communication and Innovative Technology (ICCIT)10.1109/ICCIT62134.2024.10701256(1-7)Online publication date: 7-Aug-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology
December 2021
698 pages
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 April 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Twitter classification
  2. dataset size
  3. empirical study
  4. extrapolation methods
  5. machine learning
  6. neural network

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • DFG Grant Managed Forgetting
  • European Union?s Horizon 2020 research and innovation program - ROXANNE
  • European Union?s Horizon 2020 research and innovation program - MIRROR

Conference

WI-IAT '21
Sponsor:
WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence
December 14 - 17, 2021
VIC, Melbourne, Australia

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)106
  • Downloads (Last 6 weeks)25
Reflects downloads up to 03 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)The Effect of Training Data Size on Disaster Classification from TwitterInformation10.3390/info1507039315:7(393)Online publication date: 8-Jul-2024
  • (2024)Sentiment Analysis of TikTok Comments on Indonesian Presidential Elections Using IndoBERT2024 3rd International Conference on Creative Communication and Innovative Technology (ICCIT)10.1109/ICCIT62134.2024.10701256(1-7)Online publication date: 7-Aug-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media