research-article

Improving classification of tweets using word-word co-occurrence information from a large external corpus

Authors:

Hugo Lewi Hammer,

Aleksander Bai,

Paal EngelstadAuthors Info & Claims

SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied Computing

Pages 1174 - 1177

https://doi.org/10.1145/2851613.2851986

Published: 04 April 2016 Publication History

Abstract

Classifying tweets is an intrinsically hard task as tweets are short messages which makes traditional bags of words based approach inefficient. In fact, bags of words approaches ignores relationships between important terms that do not co-occur literally.

In this paper we resort to word-word co-occurence information from a large corpus to expand the vocabulary of another corpus consisting of tweets. Our results show that we are able to reduce the number of erroneous classifications by 14% using co-occurence information.

References

[1]

A. Alahmadi, A. Joorabchi, and A. Mahdi. A new text representation scheme combining bag-of-words and bag-of-concepts approaches for automatic text classification. In GCC Conference and Exhibition (GCC), 2013 7th IEEE. IEEE Press, 2013.

[2]

L. Cai, G. Zhou, K. Liu, and J. Zhao. Large-scale question classification in cqa by leveraging wikipedia semantic knowledge. In Proceedings of the 20th ACMInternational Conference on Information and Knowledge Management, CIKM '11, pages 1321--1330, New York, NY, USA, 2011. ACM.

Digital Library

[3]

Z. Chen and Y. Lu. A word co-occurrence matrix based method for relevance feedback. Journal of Computational Information Systems, 7(1):17--24, 2011.

[4]

K. Dave, S. Lawrence, and D. M. Pennock. Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. In Proceedings of the 12th international conference on World Wide Web, pages 519--528. ACM, 2003.

Digital Library

[5]

J. Friedman, T. Hastie, and R. Tibshirani. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1):1--22, 2010.

[6]

E. Gabrilovich and S. Markovitch. Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge. In AAAI, volume 6, pages 1301--1306, 2006.

Digital Library

[7]

A. Hotho, S. Staab, and G. Stumme. Ontologies improve text document clustering. In Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, pages 541--544. IEEE, 2003.

Digital Library

[8]

A. Hotho, S. Staab, and G. Stumme. Wordnet improves text document clustering. In Proc. SIGIR Semantic Web Workshop, 2003.

[9]

V. Lampos, T. De Bie, and N. Cristianini. Flu detector: Tracking epidemics on twitter. In Proceedings of the 2010 European Conference on Machine Learning and Knowledge Discovery in Databases: Part III, ECML PKDD'10, pages 599--602, Berlin, Heidelberg, 2010. Springer-Verlag.

Digital Library

[10]

K. Lee, D. Palsetia, R. Narayanan, M. M. A. Patwary, A. Agrawal, and A. Choudhary. Twitter trending topic classification. In Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops, ICDMW '11, pages 251--258, Washington, DC, USA, 2011. IEEE Computer Society.

Digital Library

[11]

C. D. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, USA, 1999.

Digital Library

[12]

G. A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38(11):39--41, Nov. 1995.

Digital Library

[13]

J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors forword representation. http://nlp.stanford.edu/projects/glove/glove.pdf, 2015. {Online; accessed 27-July-2015}.

[14]

S. Petrović, M. Osborne, and V. Lavrenko. Using paraphrases for improving first story detection in news and twitter. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT '12, pages 338--346, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics.

Digital Library

[15]

D. Pinto, P. Rosso, and H. Jiménez-Salazar. A self-enriching methodology for clustering narrow domain short texts. The Computer Journal, 54(7):1148--1165, 2011.

Digital Library

[16]

M. Rodriguez, J. Hidalgo, and B. Agudo. Using wordnet to complement training information in text categorization. In Proceedings of 2nd International Conference on Recent Advances in Natural Language Processing II: Selected Papers from RANLP, volume 97, pages 353--364, 2000.

[17]

T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes twitter users: Real-time event detection by social sensors. In Proceedings of the 19th International Conference on World Wide Web, WWW '10, pages 851--860, New York, NY, USA, 2010. ACM.

Digital Library

[18]

X. Zhang, H. Fuehres, and P. A. Gloor. Predicting Stock Market Indicators Through Twitter I hope it is not as bad as I fear. Procedia - Social and Behavioral Sciences, 26:55--62, Jan. 2011.

[19]

A. Zubiaga, D. Spina, R. Martinez, and V. Fresno. Real-time classification of twitter trends. Journal of the American Society for Information Science and Technology, 66(3):462--473, 2015.

Digital Library

Cited By

Hammer HYazidi ABai AEngelstad P(2017)Improving Classification of Tweets Using Linguistic Information from a Large External CorpusIndustrial Networks and Intelligent Systems10.1007/978-3-319-52569-3_11(122-134)Online publication date: 19-Jan-2017
https://doi.org/10.1007/978-3-319-52569-3_11

Index Terms

Improving classification of tweets using word-word co-occurrence information from a large external corpus

Recommendations

Word classification and hierarchy using co-occurrence word information

By the development of the computer in recent years, calculating a complex advanced processing at high speed has become possible. Moreover, a lot of linguistic knowledge is used in the natural language processing (NLP) system for improving the system. ...
Two-Word Collocation Extraction Using Monolingual Word Alignment Method

Statistical bilingual word alignment has been well studied in the field of machine translation. This article adapts the bilingual word alignment algorithm into a monolingual scenario to extract collocations from monolingual corpus, based on the fact ...
A word sense disambiguation corpus for Urdu
Abstract
The aim of word sense disambiguation (WSD) is to correctly identify the meaning of a word in context. All natural languages exhibit word sense ambiguities and these are often hard to resolve automatically. Consequently WSD is considered an ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied Computing

April 2016

2360 pages

ISBN:9781450337397

DOI:10.1145/2851613

Conference Chair:
Sascha Ossowski
University Rey Juan Carlos, Spain

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGAPP: ACM Special Interest Group on Applied Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SAC 2016

Sponsor:

SIGAPP

SAC 2016: Symposium on Applied Computing

April 4 - 8, 2016

Pisa, Italy

Acceptance Rates

SAC '16 Paper Acceptance Rate 252 of 1,047 submissions, 24%;

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
141
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hammer HYazidi ABai AEngelstad P(2017)Improving Classification of Tweets Using Linguistic Information from a Large External CorpusIndustrial Networks and Intelligent Systems10.1007/978-3-319-52569-3_11(122-134)Online publication date: 19-Jan-2017
https://doi.org/10.1007/978-3-319-52569-3_11

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents