Abstract
For a number of languages, web crawling allows researchers to collect huge text samples to build corpora. However, only part of the material found on the internet is useful for Natural Language Processing, as e.g. parsers typically cannot handle lists and tables, or very short or very long sentences. There are methods (cf. e.g. [3]) for cleaning the downloaded data before adding it to a corpus collection – but even when these are applied, not all remaining textual material might be suitable for certain research requirements. This paper describes methods utilized to prepare deWaC, a freely available German web corpus of the WaCky project, for automatic processing up to the parsing level. It then discusses ways in which this corpus, called SdeWaC, has been used since its release.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Baroni, M., Kilgarriff, A.: Large linguistically-processed web corpora for multiple languages. In: Conference Companion of EACL 2006, 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 87–90 (2006)
Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora. Language Resources and Evaluation 43(3), 209–226 (2009)
Bauer, D., Degen, J., Deng, X., Herger, P., Gasthaus, J., Giesbrecht, E., Jansen, L., Kalina, C., Krüger, T., Märtin, R., Schmidt, M., Scholler, S., Steger, J., Stemle, E., Evert, S.: Fiasco: Filtering the internet by automatic subtree classification. In: Fairon, C., Naets, H., Kilgarriff, A., de Schrvyer, G.-M. (eds.) Building and Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop (WAC3), Incorporating CLEANEVAL, Louvain-la-Neuve, Belgium, pp. 111–121 (2007)
Bohnet, B.: Top accuracy and fast dependency parsing is not a contradiction. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), Coling 2010 Organizing Committee, Beijing, China, pp. 89–97 (2010)
Briscoe, T., Carrol, J.: Automatic extraction of subcategorization from corpora. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, Washington DC, USA, pp. 356–363 (1997)
Buchholz, S., Marsi, E.: CoNLL-X Shared Task on Multilingual Dependency Parsing. In: Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X), pp. 149–164. Association for Computational Linguistics, New York City (2006)
Eberle, K., Faaß, G., Heid, U.: Proposition oder Temporalangabe? Disambiguierung von -ung-Nominalisierungen von verba dicendi in nach-PPs. In: Chiarcos, C., Eckart de Castilho, R., Stede, M. (eds.) Proceedings of the Biennial GSCL Conference 2009, Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically, Potsdam, pp. 81–91. Narr, Tübingen (2009)
Faaß, G., Heid, U., Schmid, H.: Design and application of a Gold Standard for morphological analysis: SMOR in validation. In: Proceedings of the Seventh LREC Conference, European Language Resources Association (ELRA), Valetta, Malta, pp. 803–810 (2010)
Haselbach, B., Eckart, K., Seeker, W., Eberle, K., Heid, U.: Approximating Theoretical Linguistics Classification in Real Data: the Case of German “nach” Particle Verbs. In: Proceedings of COLING 2012, Mumbai, India, pp. 1113–1128 (2012)
Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33(1), 159–174 (1977)
Quasthoff, U., Richter, M., Biemann, C.: Corpus portal for search in monolingual corpora. In: Proceedings of the LREC 2006, Genoa, Italy, pp. 1799–1802 (2006)
Schiehlen, M.: A Cascaded Finite-State Parser for German. In: Proceedings of the Research Note Sessions of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2003), Budapest, pp. 163–166 (2003)
Schiller, A., Teufel, S., Thielen, C.: Guidelines für das Tagging deutscher Textcorpora mit STTS. Universität Stuttgart and Universität Tübingen (1995)
Schmid, H.: Probabilistic Part-of-Speech Tagging Using Decision Trees. In: International Conference on New Methods in Language Processing, Manchester, UK, pp. 44–49 (1994)
Schmid, H.: Unsupervised Learning of Period Disambiguation for Tokenisation. Internal Report, IMS. University of Stuttgart (2000)
Schmid, H., Fitschen, A., Heid, U.: SMOR: A German computational morphology covering derivation, composition, and inflection. In: Proceedings of LREC 2004, Lisboa, Portugal (2004)
Schulte im Walde, S.: Webkorpora für die automatische Akquisition lexikalisch-semantischen Wissens. In: Workshop Webkorpora in Computerlinguistik und Sprachforschung. Institut für Deutsche Sprache, Mannheim (2012)
Springorum, S., Schulte im Walde, S., Roßdeutscher, A.: Automatic Classification of German an Particle Verbs. In: Proceedings of the 8th International Conference on Language Resources and Evaluation. Istanbul, Turkey (2012)
Stus, O.: Web-Korpus, Korpusaufbereitung der deutschen Web-Korpora. Internal Report, IMS, Universität Stuttgart (2008)
Weller, M., Heid, U.: Extraction of german multiword expressions from parsed corpora using context features. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC 2010), pp. 3195–3201. European Language Resources Association (ELRA), Valetta (2008)
Zarrieß, S., Schäfer, F.: Schulte im Walde, S.: Passives of reflexives: a corpus study. Linguistic Evidence - Berlin Special. Berlin, Germany (2013)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Faaß, G., Eckart, K. (2013). SdeWaC – A Corpus of Parsable Sentences from the Web. In: Gurevych, I., Biemann, C., Zesch, T. (eds) Language Processing and Knowledge in the Web. Lecture Notes in Computer Science(), vol 8105. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40722-2_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-40722-2_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40721-5
Online ISBN: 978-3-642-40722-2
eBook Packages: Computer ScienceComputer Science (R0)