Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2851613.2851986acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Improving classification of tweets using word-word co-occurrence information from a large external corpus

Published: 04 April 2016 Publication History

Abstract

Classifying tweets is an intrinsically hard task as tweets are short messages which makes traditional bags of words based approach inefficient. In fact, bags of words approaches ignores relationships between important terms that do not co-occur literally.
In this paper we resort to word-word co-occurence information from a large corpus to expand the vocabulary of another corpus consisting of tweets. Our results show that we are able to reduce the number of erroneous classifications by 14% using co-occurence information.

References

[1]
A. Alahmadi, A. Joorabchi, and A. Mahdi. A new text representation scheme combining bag-of-words and bag-of-concepts approaches for automatic text classification. In GCC Conference and Exhibition (GCC), 2013 7th IEEE. IEEE Press, 2013.
[2]
L. Cai, G. Zhou, K. Liu, and J. Zhao. Large-scale question classification in cqa by leveraging wikipedia semantic knowledge. In Proceedings of the 20th ACMInternational Conference on Information and Knowledge Management, CIKM '11, pages 1321--1330, New York, NY, USA, 2011. ACM.
[3]
Z. Chen and Y. Lu. A word co-occurrence matrix based method for relevance feedback. Journal of Computational Information Systems, 7(1):17--24, 2011.
[4]
K. Dave, S. Lawrence, and D. M. Pennock. Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. In Proceedings of the 12th international conference on World Wide Web, pages 519--528. ACM, 2003.
[5]
J. Friedman, T. Hastie, and R. Tibshirani. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1):1--22, 2010.
[6]
E. Gabrilovich and S. Markovitch. Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge. In AAAI, volume 6, pages 1301--1306, 2006.
[7]
A. Hotho, S. Staab, and G. Stumme. Ontologies improve text document clustering. In Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, pages 541--544. IEEE, 2003.
[8]
A. Hotho, S. Staab, and G. Stumme. Wordnet improves text document clustering. In Proc. SIGIR Semantic Web Workshop, 2003.
[9]
V. Lampos, T. De Bie, and N. Cristianini. Flu detector: Tracking epidemics on twitter. In Proceedings of the 2010 European Conference on Machine Learning and Knowledge Discovery in Databases: Part III, ECML PKDD'10, pages 599--602, Berlin, Heidelberg, 2010. Springer-Verlag.
[10]
K. Lee, D. Palsetia, R. Narayanan, M. M. A. Patwary, A. Agrawal, and A. Choudhary. Twitter trending topic classification. In Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops, ICDMW '11, pages 251--258, Washington, DC, USA, 2011. IEEE Computer Society.
[11]
C. D. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, USA, 1999.
[12]
G. A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38(11):39--41, Nov. 1995.
[13]
J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors forword representation. http://nlp.stanford.edu/projects/glove/glove.pdf, 2015. {Online; accessed 27-July-2015}.
[14]
S. Petrović, M. Osborne, and V. Lavrenko. Using paraphrases for improving first story detection in news and twitter. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT '12, pages 338--346, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics.
[15]
D. Pinto, P. Rosso, and H. Jiménez-Salazar. A self-enriching methodology for clustering narrow domain short texts. The Computer Journal, 54(7):1148--1165, 2011.
[16]
M. Rodriguez, J. Hidalgo, and B. Agudo. Using wordnet to complement training information in text categorization. In Proceedings of 2nd International Conference on Recent Advances in Natural Language Processing II: Selected Papers from RANLP, volume 97, pages 353--364, 2000.
[17]
T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes twitter users: Real-time event detection by social sensors. In Proceedings of the 19th International Conference on World Wide Web, WWW '10, pages 851--860, New York, NY, USA, 2010. ACM.
[18]
X. Zhang, H. Fuehres, and P. A. Gloor. Predicting Stock Market Indicators Through Twitter I hope it is not as bad as I fear. Procedia - Social and Behavioral Sciences, 26:55--62, Jan. 2011.
[19]
A. Zubiaga, D. Spina, R. Martinez, and V. Fresno. Real-time classification of twitter trends. Journal of the American Society for Information Science and Technology, 66(3):462--473, 2015.

Cited By

View all
  • (2017)Improving Classification of Tweets Using Linguistic Information from a Large External CorpusIndustrial Networks and Intelligent Systems10.1007/978-3-319-52569-3_11(122-134)Online publication date: 19-Jan-2017

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied Computing
April 2016
2360 pages
ISBN:9781450337397
DOI:10.1145/2851613
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. classification
  2. lasso regression
  3. twitter
  4. word-word co-occurence

Qualifiers

  • Research-article

Conference

SAC 2016
Sponsor:
SAC 2016: Symposium on Applied Computing
April 4 - 8, 2016
Pisa, Italy

Acceptance Rates

SAC '16 Paper Acceptance Rate 252 of 1,047 submissions, 24%;
Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2017)Improving Classification of Tweets Using Linguistic Information from a Large External CorpusIndustrial Networks and Intelligent Systems10.1007/978-3-319-52569-3_11(122-134)Online publication date: 19-Jan-2017

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media