Abstract
The paper describes a corpus of texts produced by non-native speakers of Czech. We discuss its annotation scheme, consisting of three interlinked tiers, designed to handle a wide range of error types present in the input. Each tier corrects different types of errors; links between the tiers allow capturing errors in word order and complex discontinuous expressions. Errors are not only corrected, but also classified. The annotation scheme is tested on a data set including approx. 175,000 words with fair inter-annotator agreement results. We also explore the possibility of applying automated linguistic annotation tools (taggers, spell checkers and grammar checkers) to the learner text to support or even substitute manual annotation.
Similar content being viewed by others
Notes
Interlanguage is subject to constant changes as the learner progresses through successive stages of acquiring more competence, and can be seen as an individual and dynamic continuum between one’s native and target languages. See Selinker (1972).
For some members of the Czech Roma community it might be difficult to identify their first language, yet such students often exhibit a number of traits typical for the process of acquisition of Czech as a second language. Bedřichová et al. (2011) assume that the social, cultural and linguistic differences between the non-Roma majority and some Roma communities may imply specific language development of Roma children.
However, some authors intentionally avoid categorizing errors. They see categorisation as an interpretation model, influencing access to the data. Instead, they use emendation as an implicit explanation for the errors (Fitzpatrick and Seegmiller 2004).
We are aware of four other Slavic L2 corpora. However they are either small (the first one), or under development (the other three).
-
PiKUST (Stritar 2009), a 35KW corpus of learner Slovene, error annotation adopted from the Norwegian ASK project
-
piRULEC (Kisselev 2013), a corpus of learner Russian, currently being built at Portland State University; a collection of academic writings of advanced foreign and heritage learners of Russian.
-
A 10KW corpus collected from advanced American learners of Russian (Pavlenko and Hasko 2007).
-
A corpus of theses written in several Slavic languages by non-native students of the University of Helsinki.
A 7MW ‘didactical/educational’ part of the Russian National Corpus is sometimes referred to as a learner corpus, but in fact it includes works of fiction on a list of recommended readings in Russian schools (see http://www.ruscorpora.ru/en/corpora-structure.html).
-
The error taxonomy is hierarchical—error types are partitioned into domains, which are further divided into more specific subcategories, tagged manually or automatically. For example, the domain of complex verb form errors on T2 can be further specified as errors in analytical verb forms (cvf), modal verbs (mod), verbo-nominal predicates, passive or resultative form (vnp).
For the share of different learner groups according to L1 see Table 2.
See also Jelínek et al. (2012).
In Czech phonology, h and ch [x] act as voicing counterparts.
Flor and Futagi (2011) report similar results for ConSpel, a tool used to detect and correct non-word misspellings in English, using n-gram statistics based on the Google Web1T database.
After registration at http://www.korpus.cz/english/dohody.php the result is available for on-line searches as czesl-plain, one of the synchronous specialized subcorpora of the Czech National Corpus. See http://www.korpus.cz/english/czesl-plain.php for a description and http://www.korpus.cz/corpora/ for the search interface.
The size of the sample is smaller than in the previous comparison at T0 only due to a more demanding procedure to obtain the data at T1.
The reason why Morče was used to tag T1 is because it is currently the best tagger of Czech and we were only interested in the cross-tagger comparison on the ill-formed input at T0.
References
Abuhakema, G., Feldman, A., & Fitzpatrick, E. (2009). ARIDA: An Arabic interlanguage database and its applications: A pilot study. Journal of the National Council of Less Commonly Taught Languages (NCOLCTL) 7, 161–184.
Bedřichová, Z., Šebesta, K., Škodová, S., & Šormová, K. (2011). Podoba a využití korpusu jinojazyčných a romských mluvčích češtiny: CZESL a ROMi [Form and utilization of a corpus of non-native and Romany speakers of Czech: CZESL and ROMi]. In F. Čermák (Ed.), Korpusová lingvistika Praha 2011: 2 - Výzkum a výstavba korpus\(\mathring{\rm u}\), Ústav Českého národního korpusu, Nakladatelství Lidové noviny, Praha, Studie z korpusové lingvistiky, vol 15 (pp. 93–104).
Brants, T. (2000). TnT—A statistical part-of-speech tagger. In Proceedings of the sixth applied natural language processing (ANLP-2000). WA: Seattle.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
de Cock, S. (2003). Recurrent sequences of words in native speaker and advanced learner spoken and written english. PhD thesis, Université catholique de Louvain, Louvain-la-Neuve.
de Haan, P. (2000). Tagging non-native English with the TOSCA-ICLE tagger. In C. Mair & M. Hundt (Eds.), Corpus linguistics and linguistic theory. Papers from the twentieth international conference on English language research on computerized corpora (ICAME 20), (pp. 69–80). Freiburg im Breisgau 1999, Rodopi, Amsterdam.
de Mönnink, I. (2000). Parsing a learner corpus?. In C. Mair, M. Hundt (Eds.), Corpus linguistics and linguistic theory. Papers from the twentieth international conference on English language research on computerized corpora (ICAME 20), (pp. 81–90). Amsterdam: Freiburg im Breisgau 1999, Rodopi.
Dickinson, M. (2010). Generating learner-like morphological errors in Russian. In Proceedings of the 23nd international conference on computational linguistics (COLING-10). Beijing. http://jones.ling.indiana.edu/~mdickinson/papers/dickinson-coling10.html.
Díaz-Negrillo, A., & Fernández-Domínguez, J. (2006). Error tagging systems for learner corpora. Resla, 19, 83–102.
Díaz-Negrillo, A., Meurers, D., Valera, S., & Wunsch, H. (2010). Towards interlanguage POS annotation for effective learner corpora in SLA and FLT. Language Forum, 36(1–2), 139–154. http://purl.org/dm/papers/diaz-negrillo-et-al-09.html, special Issue on Corpus Linguistics for Teaching and Learning. In Honour of John Sinclair.
Fitzpatrick, E., & Seegmiller, S. (2001). The montclair electronic language learner database. In: Proceedings of the international conference on computing and information technologies (ICCIT).
Fitzpatrick, E., & Seegmiller, S. (2004). The Montclair electronic language database project. In U. Connor & T. A. Upton (Eds.), Applied corpus linguistics: A multidimensional perspective (pp. 223–238). Amsterdam: Rodopi.
Flor, M., & Futagi, Y. (2011). Automatic correction of non-word misspellings and generation of learner language corpora. In Learner corpus research 2011–20 years of learner corpus research: Looking back, moving ahead, Centre for English Corpus Linguistics. Université catholique de Louvain, Louvain-la-Neuve.
Granger, S. (1999). Use of tenses by advanced EFL learners: Evidence from error-tagged computer corpus. In H. Hasselgård, S. Oksefjell (Eds.), Out of corpora —Studies in Honour of Stig Johansson. Amsterdam: Atlanta. http://hdl.handle.net/2078.1/76322.
Granger, S. (2003a). Error-tagged learner corpora and call: A promising synergy. CALICO Journal, 20(3), 465–480.
Granger, S. (2003b) Error-tagged learner corpora and CALL: A promising synergy. CALICO Journal, 20, 465–480.
Granger, S. (2008). Learner corpora. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics. An International Handbook, HSK 29. 1., vol. 1 (pp. 259–274). Berlin: Mouton De Gruyter.
Hajič, J. (2004). Disambiguation of rich inflection (computational morphology of Czech). Prague: Charles University Press.
Hana, J., Rosen, A., Škodová, S., & Štindlová, B. (2010). Error-tagged learner corpus of Czech. In Proceedings of the fourth linguistic annotation workshop. Uppsala, Sweden: Association for Computational Linguistics. http://utkl.ff.cuni.cz/~rosen/public/hanaetal_law2010.pdf.
Hana, J., Rosen, A., Štindlová, B., & Jäger, P. (2012). Building a learner corpus. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, J. Odijk & S. Piperidis (Eds.), Proceedings of the eight international conference on language resources and evaluation (LREC’12). Istanbul, Turkey: European Language Resources Association (ELRA).
Jelínek, T. (2008). Nové značkování v Českém národním korpusu [A new tagging system in the Czech National Corpus]. Naše řeč 91, 13–20.
Jelínek, T., & Petkevič, V. (2011). Systém jazykového značkování korpus\(\mathring{\rm u}\) současné psané češtiny [A system of linguistic markup of corpora of contemporary written Czech]. In V. Petkevič & A. Rosen (Eds.), Korpusová lingvistika Praha 2011: 3 – Gramatika a značkování korpus\(\mathring{\rm u}\), Ústav Českého národního korpusu, Nakladatelství Lidové noviny, vol. 16, (pp. 154–170). Praha: Studie z korpusové lingvistiky.
Jelínek, T., Štindlová, B., Rosen, A., & Hana, J. (2012). Combining manual and automatic annotation of a learner corpus. In P. Sojka, A. Horák, I. Kopeček & K. Pala (Eds.), Text, speech and dialogue—Proceedings of the 15th international conference TSD 2012, no. 7499 in Lecture Notes in Computer Science, (pp. 127–134). Springer.
Kisselev, O. (2013). Russian learner corpus of academic writing: Design, development and applications: The American Association for Corpus Linguistics (AACL 2013), January 18–20, 2013, San Diego State University, San Diego, US.
Leech, G. (1998). Preface. In S. Granger (Ed.), Learner English on computer (pp. xiv–xx). London: Addison Wesley Longman.
Leńko-Szymańska, A. (2004). Demonstratives as anaphora markers in advanced learners’ English. In G. Aston SBDS (Ed.), Corpora and language learners (pp. 89–107). Amsterdam: John Benjamins.
Lüdeling, A. (2008). Mehrdeutigkeiten und Kategorisierung: Probleme bei der Annotation von Lernerkorpora. In P. Grommes, M. Walter (Eds.) Fortgeschrittene Lernervarietäten (pp. 119–140). Tübingen: Niemeyer.
Meurers, D. (2009). On the automatic analysis of learner language: Introduction to the special issue. CALICO Journal 26(3), 469–473. http://purl.org/dm/papers/meurers-09.html.
Nesselhauf, N. (2005). Collocations in a learner corpus. Amsterdam: John Benjamins.
Pavlenko, A., & Hasko, V. (2007). Russian emotion vocabulary in American learners’ narratives. The Modern Language Journal 91, 213–234.
Pravecm, N. A. (2002). Survey of learner corpora. ICAME Journal 26, 81–114.
Richter, M. (2010). Pokročilý korektor češtiny [An advanced spell checker of Czech]. Master’s thesis, Faculty of Mathematics and Physics, Charles University, Prague.
Ringbom, H. (1998). Vocabulary frequencies in advanced learner English: A cross-linguistic approach. In S. Granger (Ed.), Learner English on computer (pp. 41–52). Harlow: Longman.
Rozovskaya, A., & Roth, D. (2010). Annotating ESL errors: Challenges and rewards. In Proceedings of NAACL’10 workshop on innovative use of NLP for building educational applications. University of Illinois at Urbana-Champ. http://cogcomp.cs.illinois.edu/page/publication_view/212.
Selinker, L. (1972). Interlanguage. IRAL 10, 209–231.
Spoustová, D., Hajič, J., Votrubec, J., Krbec, P., & Květoň, P. (2007). The best of two worlds: Cooperation of statistical and rule-based taggers for Czech. In Proceedings of the workshop on Balto-Slavonic natural language processing 2007 (pp. 67–74). Praha, Czechia: Association for Computational Linguistics.
Stritar, M. (2009). Slovene as a foreign language: The pilot learner corpus perspective. Slovenski jezik – Slovene Linguistic Studies 7, 135–152.
Šebesta, K. (2010). Korpusy češtiny a osvojování jazyka [Corpora of Czech and language acquistion]. Studie z aplikované lingvistiky/Studies in Applied Linguistics 1, 11–34.
Štindlová, B. (2011). Evaluace chybové anotace v žákovském korpusu češtiny [Evaluation of error mark-up in a learner corpus of Czech]. PhD thesis, Charles University, Faculty of Arts, Prague.
Štindlová, B., Škodová, S., Hana, J., & Rosen, A. (2012a). CzeSL—An error tagged corpus of Czech as a second language. In P. Pęzik (Ed.), PALC 2011—Practical applications in language and computers, Lódż 13–15 April 2011. Peter Lang, Łódź Studies in Language.
Štindlová, B., Škodová, S., Hana, J., & Rosen, A. (2012b). A learner corpus of Czech: Current state and future directions. In S. Granger, G. Gilquin & F. Meunier (Eds.), Twenty years of learner corpus research: Looking back, moving ahead. Corpora and language in use—Proceedings 1. Louvain-la-Neuve: Presses Universitaires de Louvain (in print).
Štindlová, B., Škodová, S., Rosen, A., & Hana, J. (2012c). Annotating foreign learners’ Czech. In M. Ziková & M. Dočekal (Eds.), Slavic languages in formal grammar. Proceedings of FDSL 8.5, Brno 2010 (pp. 205–219). Frankfurt am Main: Peter Lang.
Tetreault, J., & Chodorow, M. (2008). Native judgements of non-native usage: Experiments in preposition error detection. In COLING workshop on human judgements in computational linguistics. Manchester.
Van Rooy, B., & Schäfer, L. (2003). An evaluation of three POS taggers for the tagging of the Tswana Learner English Corpus. In D. Archer, P. Rayson, A. Wilson & T. McEnery (Eds.), Proceedings of the corpus linguistics 2003 conference (pp. 835–844). Lancaster: UCREL, Lancaster University.
Votrubec, J. (2006). Morphological tagging based on averaged perceptron. In WDS’06 proceedings of contributed papers (pp. 191–195). Praha, Czechia: Matfyzpress, Charles University.
Waibel, B. (2008). Phrasal verbs. German and Italian learners of English compared. Saarbrücken: VDM.
Xiao, R. (2008). Well-known and influential corpora. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics. An international handbook, handbooks of linguistics and communication science [HSK] 29.1, vol. 1 (pp. 383–457). Berlin: Mouton de Gruyter.
Acknowledgments
The authors are grateful to Tomáš Jelínek and Svatava Škodová for their essential contributions to this work, and also to other members of the project team, namely Milena Hnátková, Vladimír Petkevič, and Hana Skoumalová. This research was supported by the Education for Competitiveness programme, funded by the European Structural Fund and the Czech government as a Project No. CZ.1.07/2.2.00/07.0259. It was also co-funded by the GACR Grant No. P406/10/P328, and by the NAKI programme of the Czech Ministry of Culture, Project No. DF11P01OVV013. The corpus is one of the tasks of the project Innovation of Education in the Field of Czech as a Second Language, a part of the operational programme Education for Competiveness, funded by the European Structural Funds (ESF) and the Czech government. The annotation tool was also partially funded by Grant No. P406/10/P328 of the Grant Agency of the Czech Republic.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Rosen, A., Hana, J., Štindlová, B. et al. Evaluating and automating the annotation of a learner corpus. Lang Resources & Evaluation 48, 65–92 (2014). https://doi.org/10.1007/s10579-013-9226-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-013-9226-3