Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2740908.2743039acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

ResToRinG CaPitaLiZaTion in #TweeTs

Published: 18 May 2015 Publication History

Abstract

The rapid proliferation of microblogs such as Twitter has resulted in a vast quantity of written text becoming available that contains interesting information for NLP tasks. However, the noise level in tweets is so high that standard NLP tools perform poorly. In this pa- per, we present a statistical truecaser for tweets using a 3-gram language model built with truecased newswire texts and tweets. Our truecasing method shows an improvement in named entity recognition and part-of-speech tagging tasks.

References

[1]
T. Baldwin, P. Cook, M. Lui, A. MacKinlay, and L. Wang. How noisy social media text, how diffrnt social media sources? In Proceedings of the 6th International Joint Conference on Natural Language Processing, 2013.
[2]
K. Bontcheva, L. Derczynski, A. Funk, M. A. Greenwood, D. Maynard, and N. Aswani. Twitie: An open-source information extraction pipeline for microblog text. In Proceedings of the International Conference on Recent Advances in Natural Language Processing. Association for Computational Linguistics, 2013.
[3]
A. E. Cano, M. Rowe, M. Stankovic, and A.-S. Dadzie, editors. Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', Rio de Janeiro, Brazil, May 13, 2013. CEUR-WS.org, 2013.
[4]
T. Chen and M.-Y. Kan. Creating a live, public short message service corpus: The nus sms corpus. CoRR, 2011.
[5]
H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02), 2002.
[6]
H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan, N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts, D. Damljanovic, T. Heitz, M. A. Greenwood, H. Saggion, J. Petrak, Y. Li, and W. Peters. Text Processing with GATE (Version 6). 2011.
[7]
L. Derczynski, D. Maynard, G. Rizzo, M. van Erp, G. Gorrell, R. Troncy, J. Petrak, and K. Bontcheva. Analysis of named entity recognition and linking for tweets. CoRR, abs/1410.7182, 2014.
[8]
L. Derczynski, A. Ritter, S. Clark, and K. Bontcheva. Twitter part-of-speech tagging for all: Overcoming sparse and noisy data. In Proceedings of the International Conference on Recent Advances in Natural Language Processing, 2013.
[9]
T. Finin, W. Murnane, A. Karandikar, N. Keller, J. Martineau, and M. Dredze. Annotating named entities in twitter data with crowdsourcing. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, pages 80--88. Association for Computational Linguistics, 2010.
[10]
J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL '05, pages 363--370, 2005.
[11]
J. Foster, Ö. Çetinoglu, J. Wagner, J. Le Roux, S. Hogan, J. Nivre, D. Hogan, and J. Van Genabith.# hardtoparse: Pos tagging and parsing the twitterverse. In AAAI 2011 Workshop on Analyzing Microtext, pages 20--25, 2011.
[12]
K. Gimpel, N. Schneider, B. O'Connor, D. Das, D. Mills, J. Eisenstein, M. Heilman, D. Yogatama, J. Flanigan, and N. A. Smith. Part-of-speech tagging for twitter: Annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers - Volume 2, HLT '11, pages 42--47. Association for Computational Linguistics, 2011.
[13]
A. Gravano, M. Jansche, and M. Bacchiani. Restoring punctuation and capitalization in transcribed speech. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, pages 4741--4744. IEEE, 2009.
[14]
M. Kaufmann and J. Kalita. Syntactic normalization of twitter messages. In International conference on natural language processing, Kharagpur, India, 2010.
[15]
L. V. Lita, A. Ittycheriah, S. Roukos, and N. Kambhatla. truecasing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 152--159. Association for Computational Linguistics, 2003.
[16]
C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55--60, 2014.
[17]
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL '02, pages 311--318. Association for Computational Linguistics, 2002.
[18]
R. Parker, D. Graff, J. Kong, K. Chen, and K. Maeda. English Gigaword Fifth Edition, 2011.
[19]
A. Ritter, S. Clark, O. Etzioni, et al. Named entity recognition in tweets: an experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1524--1534. Association for Computational Linguistics, 2011.
[20]
A. Stolcke. Srilm-an extensible language modeling toolkit. In Proceedings International Conference on Spoken Language Processing, pages 257--286, 2002.
[21]
E. F. Tjong Kim Sang and F. De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4, CONLL '03, pages 142--147. Association for Computational Linguistics, 2003.
[22]
W. Wang, K. Knight, and D. Marcu. Capitalizing machine translation. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pages 1--8. Association for Computational Linguistics, 2006.

Cited By

View all
  • (2022)Capitalization Normalization for Language Modeling with an Accurate and Efficient Hierarchical RNN ModelICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP43922.2022.9746492(6097-6101)Online publication date: 23-May-2022
  • (2022)Capitalization Feature and Learning Rate for Improving NER Based on RNN BiLSTM-CRF2022 IEEE International Conference on Cybernetics and Computational Intelligence (CyberneticsCom)10.1109/CyberneticsCom55287.2022.9865660(398-403)Online publication date: 16-Jun-2022
  • (2022)Dissonant Discourses: Evelyn Scott and Cyril Kay-Scott’s Experiences in Brazil (1914–1919)Life Writing10.1080/14484528.2022.211676220:3(491-508)Online publication date: 13-Sep-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
WWW '15 Companion: Proceedings of the 24th International Conference on World Wide Web
May 2015
1602 pages
ISBN:9781450334730
DOI:10.1145/2740908

Sponsors

  • IW3C2: International World Wide Web Conference Committee

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 May 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. named entity recognition on social media
  2. part-of-speech tagging on social media
  3. truecaser

Qualifiers

  • Research-article

Funding Sources

  • Swiss National Science Foundation

Conference

WWW '15
Sponsor:
  • IW3C2

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)1
Reflects downloads up to 22 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Capitalization Normalization for Language Modeling with an Accurate and Efficient Hierarchical RNN ModelICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP43922.2022.9746492(6097-6101)Online publication date: 23-May-2022
  • (2022)Capitalization Feature and Learning Rate for Improving NER Based on RNN BiLSTM-CRF2022 IEEE International Conference on Cybernetics and Computational Intelligence (CyberneticsCom)10.1109/CyberneticsCom55287.2022.9865660(398-403)Online publication date: 16-Jun-2022
  • (2022)Dissonant Discourses: Evelyn Scott and Cyril Kay-Scott’s Experiences in Brazil (1914–1919)Life Writing10.1080/14484528.2022.211676220:3(491-508)Online publication date: 13-Sep-2022
  • (2022)Towards improving the robustness of sequential labeling models against typographical adversarial examples using triplet lossNatural Language Engineering10.1017/S1351324921000486(1-29)Online publication date: 4-Feb-2022
  • (2021)Leveraging various Transformers Based Architectures for TruecasingProcedia Computer Science10.1016/j.procs.2021.10.045193(432-441)Online publication date: 2021
  • (2021)Capitalization and punctuation restoration: a surveyArtificial Intelligence Review10.1007/s10462-021-10051-x55:3(1681-1722)Online publication date: 23-Jul-2021
  • (2020)An Efficient Architecture for Predicting the Case of Characters using Sequence Models2020 IEEE 14th International Conference on Semantic Computing (ICSC)10.1109/ICSC.2020.00035(174-177)Online publication date: Feb-2020
  • (2017)The Complementary Nature of Different NLP Toolkits for Named Entity Recognition in Social MediaProgress in Artificial Intelligence10.1007/978-3-319-65340-2_65(803-814)Online publication date: 9-Aug-2017

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media