Spanish Treebank Annotation of Informal Non-standard Web Text

Taulé, Mariona; Martí, M. Antonia; Bies, Ann; Nofre, Montserrat; Garí, Aina; Song, Zhiyi; Strassel, Stephanie; Ellis, Joe

doi:10.1007/978-3-319-24800-4_2

Mariona Taulé¹⁶,
M. Antonia Martí¹⁶,
Ann Bies¹⁵,
Montserrat Nofre¹⁶,
Aina Garí¹⁶,
Zhiyi Song¹⁵,
Stephanie Strassel¹⁵ &
…
Joe Ellis¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9396))

Included in the following conference series:

International Conference on Web Engineering

864 Accesses
1 Altmetric

Abstract

This paper presents the Latin American Spanish Discussion Forum Treebank (LAS-DisFo). This corpus consists of 50,291 words and 2,846 sentences that are part-of-speech tagged, lemmatized and syntactically annotated with constituents and functions. We describe how it was built and the methodology followed for its annotation, the annotation scheme and criteria applied for dealing with the most problematic phenomena commonly encountered in this kind of informal unedited web text. This is the first available Latin American Spanish corpus of non-standard language that has been morphologically and syntactically annotated. It is a valuable linguistic resource that can be used for the training and evaluation of parsers and PoS taggers.

This material is based on research sponsored by Air Force Research Laboratory and Defense Advanced Research Projects Agency under agreement number FA8750-13-2-0045. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory and Defense Advanced Research Projects Agency or the U.S. Government.

Download to read the full chapter text

Chapter PDF

The Groningen Meaning Bank

Slovak Web Discussion Corpus

Treebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations

Article Open access 20 February 2022

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Bertran, M., Borrega, O., Martí, M.A., Taulé, M.: AnCoraPipe: A new tool for corpora annotation. Tech. rep., Working paper 1: TEXT-MESS 2.0 (Text-Knowledge 2.0) (2010). http://clic.ub.edu/files/AnCoraPipe_0.pdf
Bies, A., Mott, J., Warner, C., Kulick, S.: English Web Treebank. Linguistic Data Consortium, Philadelphia (2012)
Google Scholar
Civit, M.: Criterios de etiquetación y desambiguación morfosintáctica de corpus en español. Coleccion de monografias de la SEPLN (2003)
Google Scholar
Civit, M., Martí, M.A.: Design principles for a Spanish treebank. In: Proceedings of Treebanks and Linguistic Theories (2002)
Google Scholar
Civit, M., Martí, M.A., Bufí, N.: Cat3LB and Cast3LB: from constituents to dependencies. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds.) FinTAL 2006. LNCS (LNAI), vol. 4139, pp. 141–152. Springer, Heidelberg (2006)
Chapter Google Scholar
Dipper, S., Lüdeling, A., Reznicek, M.: NoSta-D: A corpus of German Non-Standard varieties. Non-standard DataSources in Corpusbased Research. Shaker Verlag (2013)
Google Scholar
Foster, J.: “cba to check the spellig” investigating parser performance on discussion forum post. In: Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, Los Angeles, California, pp. 381–384 (2010)
Google Scholar
Foster, J., Çetinoglu, Ö., Wagner, J., Le Roux, J., Hogan, S., Nivre, J., Hogan, D., Van Genabith, J.: # hardtoparse: POS tagging and parsing the twitterverse. In: AAAI 2011 Workshop on Analyzing Microtext, pp. 20–25 (2011)
Google Scholar
Garland, J., Strassel, S., Ismael, S., Song, Z., Lee, H.: Linguistic resources for genre-independent language technologies: user-generated content in BOLT. In: Proceedings of LREC 2012: 8th International Conference on Language Resources and Evaluation, Istanbul, Turkey (2012)
Google Scholar
Maamouri, M., Bies, A., Kulick, S., Ciul, M., Habash, N., Eskander, R.: Developing an Egyptian Arabic treebank: impact of dialectal morphology on annotation and tool development. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), Reykjavik, Iceland (2014)
Google Scholar
Marcus, M., Kim, G., Marcinkiewicz, M., MacIntyre, R., Bies, A., Ferguson, M., Katz, K., Schasberger, B.: The penn treebank: annotating predicate argument structure. In: Proceedings of the Human Language Technology Workshop, San Francisco (1994)
Google Scholar
Padró, L., Stanilovsky, E.: FreeLing 3.0: towards wider multilinguality. In: Proceedings of the Language Resources and Evaluation Conference (LREC 2012). ELRA, Istanbul, Turkey, May 2012
Google Scholar
Petrov, S., McDonald, R.: Overview of the 2012 shared task on parsing the web. In: Notes of the First Workshop on Syntactic Analysis of Non-Canonical Language (SANCL), vol. 59. Citeseer (2012)
Google Scholar
Seddah, D., Sagot, B., Candito, M., Mouilleron, V., Combet, V.: The French social media bank: a treebank of noisy user generated content. In: COLING 2012–24th International Conference on Computational Linguistics, Mumbai, pp. 2441–2458 (2012)
Google Scholar
Song, Z., Bies, A., Riese, T., Mott, J., Wright, J., Kulick, S., Ryant, N., Strassel, S., Ma, X.: From light to rich ERE: annotation of entities, relations, and events. In: Proceedings of the 3rd Workshop on EVENTS: Definition, Detection, Coreference, and Representation. The 2015 Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL HLT 2015), Denver (2015)
Google Scholar
Soriano, B., Borrega, O., Taulé, M., Martí, M.A.: Guidelines: Constituents and syntactic functions. Tech. rep., Working paper: 3LB (2008). http://clic.ub.edu/corpus/webfm_send/17

Download references

Author information

Authors and Affiliations

Linguistic Data Consortium, University of Pennsylvania, 3600 Market Street, Suite 801, Philadelphia, PA, 19104, USA
Ann Bies, Zhiyi Song, Stephanie Strassel & Joe Ellis
CLiC, University of Barcelona, Gran Via 588, 08007, Barcelona, Spain
Mariona Taulé, M. Antonia Martí, Montserrat Nofre & Aina Garí

Authors

Mariona Taulé
View author publications
You can also search for this author in PubMed Google Scholar
M. Antonia Martí
View author publications
You can also search for this author in PubMed Google Scholar
Ann Bies
View author publications
You can also search for this author in PubMed Google Scholar
Montserrat Nofre
View author publications
You can also search for this author in PubMed Google Scholar
Aina Garí
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyi Song
View author publications
You can also search for this author in PubMed Google Scholar
Stephanie Strassel
View author publications
You can also search for this author in PubMed Google Scholar
Joe Ellis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mariona Taulé .

Editor information

Editors and Affiliations

Università di Trento, Povo, Trento, Italy
Florian Daniel
Universidad del Pais Vasco, San Sebastian, Spain
Oscar Diaz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Taulé, M. et al. (2015). Spanish Treebank Annotation of Informal Non-standard Web Text. In: Daniel, F., Diaz, O. (eds) Current Trends in Web Engineering. ICWE 2015. Lecture Notes in Computer Science(), vol 9396. Springer, Cham. https://doi.org/10.1007/978-3-319-24800-4_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-24800-4_2
Published: 22 September 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24799-1
Online ISBN: 978-3-319-24800-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Spanish Treebank Annotation of Informal Non-standard Web Text

Abstract

Chapter PDF

Similar content being viewed by others

The Groningen Meaning Bank

Slovak Web Discussion Corpus

Treebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations

Keywords

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Spanish Treebank Annotation of Informal Non-standard Web Text

Abstract

Chapter PDF

Similar content being viewed by others

The Groningen Meaning Bank

Slovak Web Discussion Corpus

Treebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations

Keywords

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation