research-article

ResToRinG CaPitaLiZaTion in #TweeTs

Authors:

Kalina Bontcheva,

Genevieve GorrellAuthors Info & Claims

WWW '15 Companion: Proceedings of the 24th International Conference on World Wide Web

Pages 1111 - 1115

https://doi.org/10.1145/2740908.2743039

Published: 18 May 2015 Publication History

Abstract

The rapid proliferation of microblogs such as Twitter has resulted in a vast quantity of written text becoming available that contains interesting information for NLP tasks. However, the noise level in tweets is so high that standard NLP tools perform poorly. In this pa- per, we present a statistical truecaser for tweets using a 3-gram language model built with truecased newswire texts and tweets. Our truecasing method shows an improvement in named entity recognition and part-of-speech tagging tasks.

References

[1]

T. Baldwin, P. Cook, M. Lui, A. MacKinlay, and L. Wang. How noisy social media text, how diffrnt social media sources? In Proceedings of the 6th International Joint Conference on Natural Language Processing, 2013.

[2]

K. Bontcheva, L. Derczynski, A. Funk, M. A. Greenwood, D. Maynard, and N. Aswani. Twitie: An open-source information extraction pipeline for microblog text. In Proceedings of the International Conference on Recent Advances in Natural Language Processing. Association for Computational Linguistics, 2013.

[3]

A. E. Cano, M. Rowe, M. Stankovic, and A.-S. Dadzie, editors. Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', Rio de Janeiro, Brazil, May 13, 2013. CEUR-WS.org, 2013.

[4]

T. Chen and M.-Y. Kan. Creating a live, public short message service corpus: The nus sms corpus. CoRR, 2011.

[5]

H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02), 2002.

[6]

H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan, N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts, D. Damljanovic, T. Heitz, M. A. Greenwood, H. Saggion, J. Petrak, Y. Li, and W. Peters. Text Processing with GATE (Version 6). 2011.

Digital Library

[7]

L. Derczynski, D. Maynard, G. Rizzo, M. van Erp, G. Gorrell, R. Troncy, J. Petrak, and K. Bontcheva. Analysis of named entity recognition and linking for tweets. CoRR, abs/1410.7182, 2014.

[8]

L. Derczynski, A. Ritter, S. Clark, and K. Bontcheva. Twitter part-of-speech tagging for all: Overcoming sparse and noisy data. In Proceedings of the International Conference on Recent Advances in Natural Language Processing, 2013.

[9]

T. Finin, W. Murnane, A. Karandikar, N. Keller, J. Martineau, and M. Dredze. Annotating named entities in twitter data with crowdsourcing. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, pages 80--88. Association for Computational Linguistics, 2010.

Digital Library

[10]

J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL '05, pages 363--370, 2005.

Digital Library

[11]

J. Foster, Ö. Çetinoglu, J. Wagner, J. Le Roux, S. Hogan, J. Nivre, D. Hogan, and J. Van Genabith.# hardtoparse: Pos tagging and parsing the twitterverse. In AAAI 2011 Workshop on Analyzing Microtext, pages 20--25, 2011.

Digital Library

[12]

K. Gimpel, N. Schneider, B. O'Connor, D. Das, D. Mills, J. Eisenstein, M. Heilman, D. Yogatama, J. Flanigan, and N. A. Smith. Part-of-speech tagging for twitter: Annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers - Volume 2, HLT '11, pages 42--47. Association for Computational Linguistics, 2011.

Digital Library

[13]

A. Gravano, M. Jansche, and M. Bacchiani. Restoring punctuation and capitalization in transcribed speech. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, pages 4741--4744. IEEE, 2009.

Digital Library

[14]

M. Kaufmann and J. Kalita. Syntactic normalization of twitter messages. In International conference on natural language processing, Kharagpur, India, 2010.

[15]

L. V. Lita, A. Ittycheriah, S. Roukos, and N. Kambhatla. truecasing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 152--159. Association for Computational Linguistics, 2003.

Digital Library

[16]

C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55--60, 2014.

[17]

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL '02, pages 311--318. Association for Computational Linguistics, 2002.

Digital Library

[18]

R. Parker, D. Graff, J. Kong, K. Chen, and K. Maeda. English Gigaword Fifth Edition, 2011.

[19]

A. Ritter, S. Clark, O. Etzioni, et al. Named entity recognition in tweets: an experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1524--1534. Association for Computational Linguistics, 2011.

Digital Library

[20]

A. Stolcke. Srilm-an extensible language modeling toolkit. In Proceedings International Conference on Spoken Language Processing, pages 257--286, 2002.

[21]

E. F. Tjong Kim Sang and F. De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4, CONLL '03, pages 142--147. Association for Computational Linguistics, 2003.

Digital Library

[22]

W. Wang, K. Knight, and D. Marcu. Capitalizing machine translation. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pages 1--8. Association for Computational Linguistics, 2006.

Digital Library

Cited By

Zhang HCheng YKumar SHuang WChen MMathews R(2022)Capitalization Normalization for Language Modeling with an Accurate and Efficient Hierarchical RNN ModelICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP43922.2022.9746492(6097-6101)Online publication date: 23-May-2022
https://doi.org/10.1109/ICASSP43922.2022.9746492
Warto Muljono Purwanto Noersasongko E(2022)Capitalization Feature and Learning Rate for Improving NER Based on RNN BiLSTM-CRF2022 IEEE International Conference on Cybernetics and Computational Intelligence (CyberneticsCom)10.1109/CyberneticsCom55287.2022.9865660(398-403)Online publication date: 16-Jun-2022
https://doi.org/10.1109/CyberneticsCom55287.2022.9865660
das Graças Salgado M(2022)Dissonant Discourses: Evelyn Scott and Cyril Kay-Scott’s Experiences in Brazil (1914–1919)Life Writing10.1080/14484528.2022.211676220:3(491-508)Online publication date: 13-Sep-2022
https://doi.org/10.1080/14484528.2022.2116762
Show More Cited By

Index Terms

ResToRinG CaPitaLiZaTion in #TweeTs
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources

Recommendations

Recovering capitalization and punctuation marks for automatic speech recognition: Case study for Portuguese broadcast news

The following material presents a study about recovering punctuation marks, and capitalization information from European Portuguese broadcast news speech transcriptions. Different approaches were tested for capitalization, both generative and ...
Structural Analysis of Arabic Tweets
Lexical Normalization of Spanish Tweets
WWW '16 Companion: Proceedings of the 25th International Conference Companion on World Wide Web

Twitter data have brought new opportunities to know what happens in the world in real-time, and conduct studies on the human subjectivity on a diversity of issues and topics at large scale, which would not be feasible using traditional methods. However, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

WWW '15 Companion: Proceedings of the 24th International Conference on World Wide Web

May 2015

1602 pages

ISBN:9781450334730

DOI:10.1145/2740908

General Chairs:
Aldo Gangemi
National Research Council, Italy & Paris 13 University-CNRS, France
,
Stefano Leonardi
Sapienza University of Rome, Italy
,
Alessandro Panconesi
Sapienza University of Rome, Italy

Copyright © 2015 Copyright is held by the International World Wide Web Conference Committee (IW3C2).

Sponsors

IW3C2: International World Wide Web Conference Committee

In-Cooperation

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 May 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Swiss National Science Foundation

Conference

WWW '15

Sponsor:

IW3C2

WWW '15: 24th International World Wide Web Conference

May 18 - 22, 2015

Florence, Italy

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
153
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang HCheng YKumar SHuang WChen MMathews R(2022)Capitalization Normalization for Language Modeling with an Accurate and Efficient Hierarchical RNN ModelICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP43922.2022.9746492(6097-6101)Online publication date: 23-May-2022
https://doi.org/10.1109/ICASSP43922.2022.9746492
Warto Muljono Purwanto Noersasongko E(2022)Capitalization Feature and Learning Rate for Improving NER Based on RNN BiLSTM-CRF2022 IEEE International Conference on Cybernetics and Computational Intelligence (CyberneticsCom)10.1109/CyberneticsCom55287.2022.9865660(398-403)Online publication date: 16-Jun-2022
https://doi.org/10.1109/CyberneticsCom55287.2022.9865660
das Graças Salgado M(2022)Dissonant Discourses: Evelyn Scott and Cyril Kay-Scott’s Experiences in Brazil (1914–1919)Life Writing10.1080/14484528.2022.211676220:3(491-508)Online publication date: 13-Sep-2022
https://doi.org/10.1080/14484528.2022.2116762
Udomcharoenchaikit CBoonkwan PVateekul P(2022)Towards improving the robustness of sequential labeling models against typographical adversarial examples using triplet lossNatural Language Engineering10.1017/S1351324921000486(1-29)Online publication date: 4-Feb-2022
https://doi.org/10.1017/S1351324921000486
Singhal SModi NDandekar VMane S(2021)Leveraging various Transformers Based Architectures for TruecasingProcedia Computer Science10.1016/j.procs.2021.10.045193(432-441)Online publication date: 2021
https://doi.org/10.1016/j.procs.2021.10.045
Păiş VTufiş D(2021)Capitalization and punctuation restoration: a surveyArtificial Intelligence Review10.1007/s10462-021-10051-x55:3(1681-1722)Online publication date: 23-Jul-2021
https://doi.org/10.1007/s10462-021-10051-x
Ramena GNagaraju DMoharana SPrasanna Mohanty DPurre N(2020)An Efficient Architecture for Predicting the Case of Characters using Sequence Models2020 IEEE 14th International Conference on Semantic Computing (ICSC)10.1109/ICSC.2020.00035(174-177)Online publication date: Feb-2020
https://doi.org/10.1109/ICSC.2020.00035
Batista FFigueira Á(2017)The Complementary Nature of Different NLP Toolkits for Named Entity Recognition in Social MediaProgress in Artificial Intelligence10.1007/978-3-319-65340-2_65(803-814)Online publication date: 9-Aug-2017
https://doi.org/10.1007/978-3-319-65340-2_65

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents