Abstract
Natural language processing tools are mostly developed for and optimized on newspaper texts, and often show a substantial performance drop when applied to other types of texts such as Twitter feeds, chat data or Internet forum posts. We explore a range of easy-to-implement methods of adapting existing part-of-speech taggers to improve their performance on Internet texts. Our results show that these methods can improve tagger performance substantially.
Similar content being viewed by others
Notes
“Analyse und Instrumentarien zur Beobachtung des Schreibgebrauchs im Deutschen”, http://www.schreibgebrauch.de
Text Encoding Initiative, http://www.tei-c.org/
It could also be that the writer may have Swiss-German background where “heiss” is the correct spelling,
References
Bartz T, Beißwenger M, Storrer A (2014) Optimierung des Stuttgart-Tübingen-Tagset für die linguistische Annotation von Korpora zur internet-basierten Kommunikation: Phänomene, Herausforderungen, Erweiterungsvorschläge. Zeitschrift für germanistische Linguistik 28(1):157–198
Beißwenger M (2013) Das Dortmunder Chat-Korpus. Zeitschrift für germanistische Linguistik 41(1):161–164
Brants T (2000) TnT—A statistical part-of-speech tagger. In: Proceedings of the sixth conference on applied natural language processing, association for computational linguistics. Seattle, Washington, USA, pp 224–231, http://www.aclweb.org/anthology/A00-1031
Brants S, Dipper S, Eisenberg P, Hansen S, König E, Lezius W, Rohrer C, Smith G, Uszkoreit H (2004) TIGER: Linguistic Interpretation of a German Corpus. J Lang Comput, Special Issue 2(4):597–620
Horbach A, Steffen D, Thater S, Pinkal M (2014) Improving the performance of standard part-of-speech taggers for computer-mediated communication. In: Proceedings of KONVENS, pp 171–177
IDS (2014) Deutsches Referenzkorpus. Archiv der Korpora geschriebener Gegenwartssprache 2014-II (Release from 11092014) http://www.ids-mannheim.de/DeReKo
Krome S (2010) Die deutsche Gegenwartssprache im Fokus korpusbasierter Lexikographie. Korpora als Grundlage moderner allgemeinsprachlicher Wörterbücher am Beispiel des WAHRIG Textkorpus\(^{\mbox{digital}}\). In: Kratochvílová I, Wolf NR (eds) Kompendium Korpuslinguistik. Eine Bestandsaufnahme aus deutsch-tschechischer Perspektive. Universitätsverlag Winter, Heidelberg, pp 117–134
Kübler S, Baucom E (2011) Fast domain adaptation for part of speech tagging for dialogues. In: Angelova G, Bontcheva K, Mitkov R, Nicolov N (eds) RANLP, RANLP 2011 Organising Committee, pp 41–48
Münzberg F (2011) Korpusrecherche in der Dudenredaktion. Ein Werkstattbericht. In: Konopka M et al (eds) Grammatik und Korpora 2009. Narr Francke Attempto, Tübingen, pp 181–197
Schiller A, Teufel S, Stöckert C, Thielen C (1999) Guidelines für das Tagging deutscher Textcorpora mit STTS. Tech. rep., IMS-CL, University Stuttgart, Stuttgart.http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/TagSets/stts-1999.pdf
Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of International Conference on New Methods in Language Processing, Manchester, UK
Skut W, Krenn B, Brants T, Uszkoreit H (1997) An annotation scheme for free word order languages. In: Proceedings of the Fifth Conference on Applied Natural Language Processing ANLP-97, Washington, DC
Stolcke A (2002) SRILM – an extensible language modeling toolkit. In: Proceedings International Conference on Spoken Language Processing, pp 257–286
Toutanova K, Klein D, Manning C, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2003), Edmonton, Canada, pp 252–259
Wiegand M, Roth B, Klakow D (2012) Web-based Relation Extraction for the Food Domain. In: Proceedings of the International Conference on Applications of Natural Language Processing to Information Systems (NLDB), Springer, Groningen, the Netherlands, pp 222–227
Acknowledgement
This work is part of the BMBF-funded project “Analyse und Instrumentarien zur Beobachtung des Schreibgebrauchs im Deutschen.” We thank our student assistants Jana Ott, Ali Abbas, Jakob Prange and Maximilian Wolf for their support in the annotation and evaluation of our data sets.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Horbach, A., Thater, S., Steffen, D. et al. Internet Corpora: A Challenge for Linguistic Processing. Datenbank Spektrum 15, 41–47 (2015). https://doi.org/10.1007/s13222-014-0172-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13222-014-0172-z