research-article

Open access

On the Impact of Dataset Size:A Twitter Classification Case Study

Authors:

Thi Huyen Nguyen,

Hoang H. Nguyen,

Tuan-Anh Hoang,

Thanh-Nam DoanAuthors Info & Claims

WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology

Pages 210 - 217

https://doi.org/10.1145/3486622.3493960

Published: 13 April 2022 Publication History

All formats PDF

Abstract

The recent advent and evolution of deep learning models and pre-trained embedding techniques have created a breakthrough in supervised learning. Typically, we expect that adding more labeled data improves the predictive performance of supervised models. On the other hand, collecting more labeled data is not an easy task due to several difficulties, such as manual labor costs, data privacy, and computational constraint. Hence, a comprehensive study on the relation between training set size and the classification performance of different methods could be essentially useful in the selection of a learning model for a specific task. However, the literature lacks such a thorough and systematic study. In this paper, we concentrate on this relationship in the context of short, noisy texts from Twitter. We design a systematic mechanism to comprehensively observe the performance improvement of supervised learning models with the increase of data sizes on three well-known Twitter tasks: sentiment analysis, informativeness detection, and information relevance. Besides, we study how significantly better the recent deep learning models are compared to traditional machine learning approaches in the case of various data sizes. Our extensive experiments show (a) recent pre-trained models have overcome big data requirements, (b) a good choice of text representation has more impact than adding more data, and (c) adding more data is not always beneficial in supervised learning.

References

[1]

Piush Aggarwal. 2019. Classification approaches to identify informative tweets. In Proceedings of the Student Research Workshop Associated with RANLP.

[2]

Ahmed Sulaiman M Alharbi and Elise de Doncker. 2019. Twitter sentiment analysis with a deep neural network: An enhanced approach using user behavioral information. Cognitive Systems Research(2019).

[3]

Alhanoof Althnian, Duaa AlSaeed, Heyam Al-Baity, Amani Samha, Alanoud Bin Dris, Najla Alzakari, Afnan Abou Elwafa, and Heba Kurdi. 2021. Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain. Applied Sciences (2021).

[4]

Alessio Benavoli, Giorgio Corani, and Francesca Mangili. 2016. Should we really use post-hoc tests based on mean-ranks?Machine Learning Research(2016).

[5]

Cody Buntain, Jennifer Golbeck, Brooke Liu, and Gary LaFree. 2016. Evaluating Public Response to the Boston Marathon Bombing and Other Acts of Terrorism through Twitter. In ICWSM.

[6]

Pete Burnap, Gualtiero Colombo, Rosie Amery, Andrei Hodorog, and Jonathan Scourfield. 2017. Multi-class machine classification of suicide-related communication on Twitter. Online social networks and media 2 (2017), 32–44.

[7]

Junghwan Cho, Kyewook Lee, Ellie Shin, Garry Choy, and Synho Do. 2015. How much data is needed to train a medical image deep learning system to achieve necessary high accuracy?. In arXiv preprint arXiv:1511.06348.

[8]

Jacob Delvin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT.

[9]

Nicholas A Diakopoulos and David A Shamma. 2010. Characterizing debate performance via aggregated twitter sentiment. In SIGCHI.

[10]

Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. 2015. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In IJCAI.

[11]

Rosa L Figueroa, Qing Zeng-Treitler, Sasikiran Kandula, and Long H Ngo. 2012. Predicting sample size required for classification performance. BMC medical informatics and decision making(2012).

[12]

Lewis J Frey and Douglas H Fisher. 1999. Modeling decision tree performance with the power law. In Seventh International Workshop on Artificial Intelligence and Statistics. PMLR.

[13]

Milton Friedman. 1940. A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics(1940).

[14]

Alec Go, Richa Bhayani, and Lei Huang. 2009. Twitter sentiment classification using distant supervision. CS224N project report, Stanford(2009).

[15]

Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber. 2016. LSTM: A search space odyssey. IEEE transactions on neural networks and learning systems (2016).

[16]

Baohua Gu, Feifang Hu, and Huan Liu. 2001. Modelling classification performance for large data sets. In International Conference on Web-Age Information Management.

[17]

M.A. Hearst, S.T. Dumais, E. Osuna, J. Platt, and B. Scholkopf. 1998. Support vector machines. IEEE Intelligent Systems and their applications (1998).

[18]

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Patwary, Mostofa Ali, Yang Yang, and Yanqi Zhou. 2017. Deep learning scaling is predictable, empirically. In arXiv preprint arXiv:1712.00409.

[19]

Tuan-Anh Hoang, Thi Huyen Nguyen, and Wolfgang Nejdl. 2019. Efficient Tracking of Breaking News in Twitter. In WebSci.

[20]

Mark Johnson, Peter Anderson, Mark Dras, and Mark Steedman. 2018. Predicting accuracy on large datasets from smaller pilot data. In ACL.

[21]

Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.

[22]

Prasanth Kolachina, Nicola Cancedda, Marc Dymetman, and Sriram Venkatapathy. 2012. Prediction of learning curves in machine translation. In ACL.

[23]

David D. Lewis. 1998. The independence assumption in information retrieval. In ECML.

[24]

Trond Linjordet and Krisztian Balog. 2019. Impact of Training Dataset Size on Neural Answer Selection Models. In ECIR.

[25]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NeurIPS.

[26]

Sheeba Naz, Aditi Sharan, and Nidhi Malik. 2018. Sentiment classification on twitter data using support vector machine. In WI.

[27]

Dat Quoc Nguyen, Thanh Vu, and Anh Tuan Nguyen. 2020. BERTweet: A pre-trained language model for English Tweets. In EMNLP: System Demonstrations.

[28]

Dat Quoc Nguyen, Thanh Vu, Afshin Rahimi, Mai Hoang Dao, Linh The Nguyen, and Long Doan. 2020. WNUT-2020 Task 2: Identification of Informative COVID-19 English Tweets. In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020).

[29]

Dat Tien Nguyen, Kamela Ali Al Mannai, Shafiq Joty, Hassan Sajjad, Muhammad Imran, and Prasenjit Mitra. 2017. Robust Classification of Crisis-Related Data on Social Networks Using Convolutional Neural Networks. In ICWSM.

[30]

Thi Huyen Nguyen, Tuan-Anh Hoang, and Wolfgang Nejdl. 2019. Efficient Summarizing of Evolving Events from Twitter Streams. In SDM.

[31]

Alexandra Olteanu, Carlos Castillo, Fernando Diaz, and Sarah Vieweg. 2014. Crisislex: A lexicon for collecting and filtering microblogged communications in crises. In ICWSM.

[32]

Joseph Prusa, Taghi M. Khoshgoftaar, and Naeem Seliya. 2015. The Effect of Dataset Size on Training Tweet Sentiment Classifiers. ICMLA (2015).

[33]

C Rossi, FS Acerbo, K Ylinen, I Juga, P Nurmi, A Bosca, F Tarasconi, M Cristoforetti, and A Alikadic. 2018. Early detection and information extraction for weather-induced floods using social media streams. International journal of disaster risk reduction (2018).

[34]

Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2010. Earthquake shakes Twitter users: real-time event detection by social sensors. In TheWebConf.

[35]

John Shawe-Taylor, Martin Anthony, and N.L.Biggs. 1993. Bounding sample size with the Vapnik-Chervonenkis dimension. Discrete Applied Mathematics(1993).

[36]

Bing Xiang and Liang Zhou. 2014. Improving twitter sentiment analysis with topic-based mixture modeling and semi-supervised training. In ACL.

[37]

David Zimbra, Ahmed Abbasi, Daniel Zeng, and Hsinchun Chen. 2018. The state-of-the-art in Twitter sentiment analysis: A review and benchmark evaluation. TMIS (2018).

Digital Library

Cited By

Effrosynidis DSylaios GArampatzis A(2024)The Effect of Training Data Size on Disaster Classification from TwitterInformation10.3390/info1507039315:7(393)Online publication date: 8-Jul-2024
https://doi.org/10.3390/info15070393
Samosir FRiyaldi S(2024)Sentiment Analysis of TikTok Comments on Indonesian Presidential Elections Using IndoBERT2024 3rd International Conference on Creative Communication and Innovative Technology (ICCIT)10.1109/ICCIT62134.2024.10701256(1-7)Online publication date: 7-Aug-2024
https://doi.org/10.1109/ICCIT62134.2024.10701256

Recommendations

DCPE co-training for classification

Co-training is a well-known semi-supervised learning technique that applies two basic learners to train the data source, which uses the most confident unlabeled data to augment labeled data in the learning process. In the paper, we use the diversity of ...
Pattern classification and clustering: A review of partially supervised learning approaches

The paper categorizes and reviews the state-of-the-art approaches to the partially supervised learning (PSL) task. Special emphasis is put on the fields of pattern recognition and clustering involving partially (or, weakly) labeled data sets. The major ...
A Game Theoretic Analysis of the Twitter Follow-Unfollow Mechanism
Decision and Game Theory for Security
Abstract
Twitter users often crave more followers to increase their social popularity. While a variety of factors have been shown to attract the followers, very little work has been done to analyze the mechanism how Twitter users follow or unfollow each ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology

December 2021

698 pages

ISBN:9781450391153

DOI:10.1145/3486622

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGAI: ACM Special Interest Group on Artificial Intelligence

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 April 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

DFG Grant Managed Forgetting
European Union?s Horizon 2020 research and innovation program - ROXANNE
European Union?s Horizon 2020 research and innovation program - MIRROR

Conference

WI-IAT '21

Sponsor:

SIGAI

WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence

December 14 - 17, 2021

VIC, Melbourne, Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
209
Total Downloads

Downloads (Last 12 months)106
Downloads (Last 6 weeks)25

Reflects downloads up to 03 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Effrosynidis DSylaios GArampatzis A(2024)The Effect of Training Data Size on Disaster Classification from TwitterInformation10.3390/info1507039315:7(393)Online publication date: 8-Jul-2024
https://doi.org/10.3390/info15070393
Samosir FRiyaldi S(2024)Sentiment Analysis of TikTok Comments on Indonesian Presidential Elections Using IndoBERT2024 3rd International Conference on Creative Communication and Innovative Technology (ICCIT)10.1109/ICCIT62134.2024.10701256(1-7)Online publication date: 7-Aug-2024
https://doi.org/10.1109/ICCIT62134.2024.10701256

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten